Research Question

What is AO3?

Archive of Our Own, or AO3, is a website that houses one of the largest amounts of fiction, specially it focuses on fanfiction. Fanfiction is a piece of fiction from usually pre-existing TV shows, books, or other types of media that is written by a fan. This works can twist the certain canon of the piece of media or it can take the characters and put them in a completely new situation.

Why AO3?

Physical book data can be hard to gather, especially trying to find accurate numbers since books can be bought or borrowed, but with AO3, since everything is digital, it is very easy to find the number of views a certain piece of fiction has or how well received it was (through likes). Although it is a fan site filled with fan-made works, it can still provide a lot of information for writers and companies for these different movies, TV series, and books by finding what the fans and audience tends to like and enjoy on this website. Along with AO3 being capable of tracking the number of views and likes on the fanfiction, the website also has a comprehensive tagging system that the fanfiction authors can use. This tagging system can help to find the main aspects of the work that draws readers to it with tags, such as ‘angst’, ‘fluff’, and ‘alternate universe’.

By using the AO3 data, different popular media companies and independent writers may be able to determine what ideas and concepts may aid in the process of determining how to increase or maximize the number of likes. One can look at the general rating (general audiences, teens and up, mature audiences, etc.), number of comments (to determine engagement), length (with the number of chapters or word count), and several other variables to determine how a book or movie can gain more views. Additionally, this can be beneficial to fanfiction authors and readers if they wish to find out what variables aid most in gaining the most views generally (across all movies, books, and TV shows) or specifically for one book or movie.

The Question

As stated in the previous section, there are several variables that AO3 tracks. Since AO3 contains copious amounts of fanfiction, this can help to determine how much fan interaction a certain idea or several ideas may have. With this data, I hope to learn what variables, such as number of comments, number of views, or combination of tags, help to best predict the number of likes, meaning I wish to measure the amount of positive fan interaction that a combination of variables may provide. This positive fan interaction can have insight into the next steps that a TV series, movie, or book can take to continue to have avid followers of the series and perhaps bring new viewers/readers into the media’s community. This leads into the overarching question: what are the variables from AO3 data that can help determine the amount of positive fan interaction?

The Data Set

Where is the data from?

I used an article from Medium by Sophia Z. to create a web scraping Python code, called “WebScraping.ipynb” in the GitHub, to get current data from the AO3 website. The code from Medium provided a skeleton, but I did modify the code to extract some more data. The data that was web scraped was from the AO3 website. The dataset includes all of the works that were updated from March 12, 2023 to March 20, 2023. I chose these dates since it will be current data and to limit the amount of data that will be in the dataset; there are a total of 40,037 records in the dataset. To access my specific data set, one can access the file here. Since the file is so large, it cannot be put onto GitHub normally.

The Variables

The dataset includes numerical, categorical, and other variables. Here are the numerical variables:

  • Word_count: This gives the current number of words in the fanfiction.
  • Num_chapters: The number of chapters in the fanfiction.
  • Num_comments: The total number of comments.
  • Num_kudos: The total number of likes (kudos).
  • Num_bookmarks: The number of people who publicly bookmarked the fanfiction.
  • Num_hits: The number of views (hits).

There are also quite a few categorical variables:

  • Complete: It states whether or not the fanfiction is complete
  • Warning: This indicates what type of severe content is present in the fanfiction (if any).
  • Pairing: This states whether there are any romantic pairings and what type.
    • ‘F/M’: there is a romantic relationship between a female and male character.
    • ‘M/M’: there is a romantic relationship between two male characters.
    • ‘F/F’: there is a romantic relationship between two female characters.
    • ‘Gen’: no romantic relationships or relationships are not the main focus of the work.
    • ‘Mutli’: contains more than one relationship.
    • ‘Other’: anything that does not fall into the previous categories.
  • Rating: This gives the general age range the audience should be to read the fanfiction.
    • ‘Explicit’: only suitable for adults
    • ‘Mature’
    • ‘Teens and Up Audiences’
    • ‘General Audiences’
    • ‘Not Rated’

Along with all of these variables, there are other variables as well:

  • Title: The title of the fanfiction.
  • Author: The author of the fanfiction.
  • ID: A unique identifier for the fanfiction.
  • Fandoms: The states the different books, movies, or TV shows that the fanfiction uses. This can contain one or more fandoms.
  • Date_updated: The day that the fanfiction that was updated.
  • Relationships: This states the different romantic or platonic relationships that is contained in the fanfiction.
  • Characters: This gives the names of all the characters.
  • Tags: Other content identifiers for the fanfiction.
  • Language: This gives the language of the fanfiction. (This only contains ‘English’ in this dataset.)

Type of Data

This data easily falls under the cross-sectional data umbrella. A cross-sectional data set consists of a sample of variables at a given point of time. Of course, the time is not necessarily correspond to the exact same time period, just similar times. My data set was accumulated over a nine day period, where each fanfiction is different – there are no duplicates. It cannot be time series because it does not follow the same variable or several variables over time (collecting data at each new observation), and it cannot be panel (or longitudinal data) because it follows multiple samples over time.

Level of Aggregation

The data consists of individual fanfiction data, where each row is a different fanfiction that was updated. It is not filtered by any specific show or rating. Though, the language data was filtered to consist only of fanfiction written in English.

Data Exploration

Loading the Data Set

First, I take the CSV file of the web scraped data from AO3 and load it into R. I also save a backup of the data set in case something goes wrong with the original data frame.

ao3 <- read.csv('March2023_AO3.csv')
# good practice in case something gets messed up in the code later
ao3_backup <- ao3

Variables

When we first upload the data frame, one should take a look at the data types of all of the original variables/columns, which can easily be done with the str() command.

str(ao3)
## 'data.frame':    40037 obs. of  19 variables:
##  $ Title        : chr  "Dark Matter" "Harry Potter and the Shadowed Light" "Sincerely, Deku." "Blood and Gold" ...
##  $ Author       : chr  "mysterycyclone" "Itshannieee" "Bakanohero" "ObsidianPen" ...
##  $ ID           : int  30153540 10404927 26920015 10643571 17636003 30222195 36553525 7361668 22428592 35597200 ...
##  $ Fandoms      : chr  "['The Avengers (Marvel Movies)', 'Spider-Man (Tom Holland Movies)', 'DCU (Comics)', 'Batman - All Media Types',"| __truncated__ "['Harry Potter - J. K. Rowling']" "['僕のヒーローアカデミア | Boku no Hero Academia | My Hero Academia']" "['Harry Potter - J. K. Rowling']" ...
##  $ Date_updated : chr  "16 Mar 2023" "16 Mar 2023" "19 Mar 2023" "16 Mar 2023" ...
##  $ Rating       : chr  "Not Rated" "Mature" "Teen And Up Audiences" "Explicit" ...
##  $ Pairing      : chr  "Gen" "M/M, F/M" "Gen" "F/M" ...
##  $ Warning      : chr  "Choose Not To Use Archive Warnings, Graphic Depictions Of Violence" "No Archive Warnings Apply" "No Archive Warnings Apply" "Choose Not To Use Archive Warnings" ...
##  $ Relationships: chr  "['Peter Parker & Tony Stark', 'Peter Parker & Avengers Team', 'Clark Kent & Peter Parker', 'Diana (Wonder Woman"| __truncated__ "['Harry Potter/Tom Riddle', 'Harry Potter/Voldemort', 'Harry Potter & Neville Longbottom', 'Sirius Black/Severu"| __truncated__ "['Aizawa Shouta | Eraserhead & Midoriya Izuku']" "['Hermione Granger/Tom Riddle', 'Hermione Granger/Voldemort']" ...
##  $ Characters   : chr  "['Peter Parker', 'Bruce Wayne', 'Jason Todd', 'Dick Grayson', 'Duke Thomas', 'Tim Drake']" "['Harry Potter', 'Tom Riddle | Voldemort', 'Kreacher (Harry Potter)', 'Ron Weasley', 'Ginny Weasley', 'Molly We"| __truncated__ "['Midoriya Izuku', 'Aizawa Shouta | Eraserhead', 'Sasaki Mirai | Sir Nighteye', 'Fukukado Emi | Ms. Joke', 'Mid"| __truncated__ "['Hermione Granger', 'Tom Riddle', 'Tom Riddle | Voldemort', 'Voldemort', 'Draco Malfoy', 'Abraxas Malfoy', 'He"| __truncated__ ...
##  $ Tags         : chr  "['Not Avengers: Endgame (Movie) Compliant', 'Post-Avengers: Infinity War Part 1 (Movie)', 'Soul Stone (Marvel)'"| __truncated__ "['Manipulative Dumbledore', 'Evil Dumbledore', 'Weasley Bashing', 'Dark Harry', 'Good Death Eaters', 'Good Vold"| __truncated__ "['Emails', 'email', 'Texting', 'Midoriya Izuku Does Not Have One for All Quirk', 'Quirkless Midoriya Izuku', 'Q"| __truncated__ "['Time Travel', 'HP: EWE', 'Post-War', \"Not a 'Hermione goes to school with Tom Riddle Fic'\", 'Dark', 'Psycho"| __truncated__ ...
##  $ Complete     : chr  "Work in Progress" "Work in Progress" "Work in Progress" "Work in Progress" ...
##  $ Language     : chr  "English" "English" "English" "English" ...
##  $ Word_count   : chr  "205,637" "270,566" "69,440" "215,087" ...
##  $ Num_chapters : int  42 46 112 39 78 57 37 70 12 38 ...
##  $ Num_comments : int  9839 4206 4701 4466 6298 827 1641 3540 4164 974 ...
##  $ Num_kudos    : int  30895 30219 19506 15042 13946 12857 11186 10150 9587 8270 ...
##  $ Num_bookmarks: int  8695 8995 4352 3672 3377 733 2877 1939 2734 2375 ...
##  $ Num_hits     : int  1010534 1039591 557501 469704 473485 390889 274057 282140 331209 219106 ...

All of the variables are one of two types currently: integer or string. There are several variables that can be converted to a factor type or different type, such as ‘Complete’ or ‘Rating’ because they consist of only a few different phrases. Therefore, I will change these variables into factors. The ‘Date_updated’ column can be converted into a date type for potential use later. Additionally, one can see that the ‘Word_count’ column is a string even though it should be a numeric value, so this values will be converted from a string into numeric.

Additionally, for the ‘Pairing’ column, if a record contained more than one pairing type, it lists all of the potential types. For example, a column can contain “‘F/M,M/M,Other” or “M/M,Gen,Multi”. To deal with all of these variations, any record that contains more than one pairing type will be replaced with ’Multi’, indicating that there are multiple different types.

# fix pairing column to have multi
ao3$Pairing <- gsub('.*,.*', 'Multi', ao3$Pairing)

# change date into Date type
ao3$Date_updated <- as.Date(ao3$Date_updated, format="%d %B %Y")
# change categorical variables like Rating and Complete into factors
ao3$Complete <- as.factor(ao3$Complete)
ao3$Rating <- as.factor(ao3$Rating)
ao3$Pairing <- as.factor(ao3$Pairing)
ao3$Warning <- as.factor(ao3$Warning)
# convert Word_count column into numeric
ao3$Word_count <- as.numeric(gsub(",","",ao3$Word_count))

After doing these conversions, one can look to see if there are any new columns that can be created that may be useful for future studying. For example, one can see if a fanfiction includes multiple different fandoms (or TV show, book, or movie communities) by checking the ‘Fandoms’ column to see if the list contains more than one element (if it does, it makes the piece of fiction a crossover.). Some other variables to consider is if the fanfiction contains any romantic or platonic relationships to see if these types of relationships can increase positive fan interaction. There are also several numerical columns that can be created to be considered such as the number of romantic or platonic relationships a fanfiction has along with the total number of characters (or number of prominent characters) in the fanfiction.

In AO3, with its tagging system, romantic relationships tend to be indicated with a “/” in between two characters’ names while platonic relationships are indicated with “&” between the names. So, to find out if there are any romantic or platonic relationships, one will just need to see if in the ‘Relationships’ column it contains “/” or “&”, and to find the number of each relationship type, one would just need to count the respective symbol. It will be more difficult to determine the number of characters since there is no special symbol associated with it, so I count the number of single quotes and divide it by two (with integer division).

# Create crossover (fanfiction from multiple fandoms) column
ao3$Is_crossover <- grepl(",",ao3$Fandoms)
# Create column if fanfiction contains romance
ao3$Has_romance <- grepl("/", ao3$Relationships)
# Create column if fanfiction contains platonic relationships
ao3$Has_platonic <- grepl("&", ao3$Relationships)

# use stringr library to count the occurrance of a pattern
library(stringr)
# Number of romantic relationships
ao3$Num_romance <- str_count(ao3$Relationships, "/")
# Number of platonic relationships
ao3$Num_platonic <- str_count(ao3$Relationships, "&")
# Number of characters in fanfiction
ao3$Num_characters <- str_count(ao3$Characters, "'") %/% 2

As discussed previously, AO3 has a very expansive and detailed tagging system for the fanfiction authors to use. The authors have a lot of independence in what they can place in the tags, so for the purpose of this project, I will look through all of the potential tags in the data set and determine the frequency of each tag. Before I can determine the frequency, I need to split all of the string-turned arrays to find all of the potential tags in the data set, and I do this by creating a function, called ‘listoflists’.

listoflists <- function(col) {
  a <- gsub('"',"'",col[1])
  a <- gsub("'\\]","",gsub("\\['","",a))
  b <- strsplit(a,"', '")
  list <- b
  for (i in 2:nrow(ao3)) { # nrow(ao3)
    a <- gsub('"',"'",col[i])
    a <- gsub("'\\]","",gsub("\\['","",a))
    b <- strsplit(a,"', '")
    list<-c(list,b)
  }
  result <- do.call(c,list)
  # use lowercase so that 'Cute' and 'cute' will be treated the same
  return(lapply(result,tolower)) 
}

This function takes each record, splits the string, does some string substitution, and takes the tags and put them into a large array. With this function, I can now find the frequency of all of the different tags in the data frame. Then, I use commands from the ‘vctrs’ library to determine the frequency of the tags.

# Frequency of Tags
l<-listoflists(ao3$Tags)
library(vctrs)
tags <- vec_count(l)

For the sake of this project, we will only take into account the top five tags in the list, which is listed below.

# Take top five tags
top_five <- tags[[1]][1:5]
for (i in top_five) {
  print(i)
}
## [1] "fluff"
## [1] "angst"
## [1] "hurt/comfort"
## [1] "alternate universe - canon divergence"
## [1] "fluff and angst"

The top five tags, as given above, is ‘fluff’, ‘angst’, ‘hurt/comfort’, ‘alternate universe - canon divergence’, and ‘fluff and angst’. These results are interesting since readers seem to want content that either happy and cute (fluff) or sad and depressing (angst) or a mixture of both (fluff and angst). Also, readers tend to want a story that is still similar to the overall story with only some changes – a divergence from the original canon (alternate universe - canon divergence). Readers also want stories that include some angst in some way, but it is immediately followed by comforting the character or making the situation better (hurt/comfort). With these top five tags, we can now create new columns with TRUE or FALSE values, indicating whether or not a tag is used in the fanfiction. Additionally, I will create a column that will be TRUE if the fanfiction contains any of the top five tags to see if that may have an overall impact in the data exploration or model.

# make tags column lowercase for easier comparison
ao3$Tags <- tolower(ao3$Tags)
# make new columns for top 5 tags
ao3$Fluff <- grepl(top_five[[1]], ao3$Tags)
ao3$Angst <- grepl(top_five[[2]], ao3$Tags)
ao3$Hurt_comfort <- grepl(top_five[[3]], ao3$Tags)
ao3$AU <- grepl(top_five[[4]], ao3$Tags)
ao3$Fluff_angst <- grepl(top_five[[5]], ao3$Tags)

# Create column that is TRUE when fanfiction contains any top five tag
ao3$Five_tags <- ao3$Fluff | ao3$Angst | ao3$Hurt_comfort | ao3$AU | ao3$Fluff_angst

Finally, we can check all of the new variables and data types using the str() command to make sure all of the column (old and new) are behaving properly.

# check new data types and variables
str(ao3)
## 'data.frame':    40037 obs. of  31 variables:
##  $ Title         : chr  "Dark Matter" "Harry Potter and the Shadowed Light" "Sincerely, Deku." "Blood and Gold" ...
##  $ Author        : chr  "mysterycyclone" "Itshannieee" "Bakanohero" "ObsidianPen" ...
##  $ ID            : int  30153540 10404927 26920015 10643571 17636003 30222195 36553525 7361668 22428592 35597200 ...
##  $ Fandoms       : chr  "['The Avengers (Marvel Movies)', 'Spider-Man (Tom Holland Movies)', 'DCU (Comics)', 'Batman - All Media Types',"| __truncated__ "['Harry Potter - J. K. Rowling']" "['僕のヒーローアカデミア | Boku no Hero Academia | My Hero Academia']" "['Harry Potter - J. K. Rowling']" ...
##  $ Date_updated  : Date, format: "2023-03-16" "2023-03-16" ...
##  $ Rating        : Factor w/ 5 levels "Explicit","General Audiences",..: 4 3 5 1 5 3 5 3 3 4 ...
##  $ Pairing       : Factor w/ 7 levels "F/F","F/M","Gen",..: 3 5 3 2 5 2 4 4 5 3 ...
##  $ Warning       : Factor w/ 120 levels "Choose Not To Use Archive Warnings",..: 2 81 81 1 33 1 18 33 1 81 ...
##  $ Relationships : chr  "['Peter Parker & Tony Stark', 'Peter Parker & Avengers Team', 'Clark Kent & Peter Parker', 'Diana (Wonder Woman"| __truncated__ "['Harry Potter/Tom Riddle', 'Harry Potter/Voldemort', 'Harry Potter & Neville Longbottom', 'Sirius Black/Severu"| __truncated__ "['Aizawa Shouta | Eraserhead & Midoriya Izuku']" "['Hermione Granger/Tom Riddle', 'Hermione Granger/Voldemort']" ...
##  $ Characters    : chr  "['Peter Parker', 'Bruce Wayne', 'Jason Todd', 'Dick Grayson', 'Duke Thomas', 'Tim Drake']" "['Harry Potter', 'Tom Riddle | Voldemort', 'Kreacher (Harry Potter)', 'Ron Weasley', 'Ginny Weasley', 'Molly We"| __truncated__ "['Midoriya Izuku', 'Aizawa Shouta | Eraserhead', 'Sasaki Mirai | Sir Nighteye', 'Fukukado Emi | Ms. Joke', 'Mid"| __truncated__ "['Hermione Granger', 'Tom Riddle', 'Tom Riddle | Voldemort', 'Voldemort', 'Draco Malfoy', 'Abraxas Malfoy', 'He"| __truncated__ ...
##  $ Tags          : chr  "['not avengers: endgame (movie) compliant', 'post-avengers: infinity war part 1 (movie)', 'soul stone (marvel)'"| __truncated__ "['manipulative dumbledore', 'evil dumbledore', 'weasley bashing', 'dark harry', 'good death eaters', 'good vold"| __truncated__ "['emails', 'email', 'texting', 'midoriya izuku does not have one for all quirk', 'quirkless midoriya izuku', 'q"| __truncated__ "['time travel', 'hp: ewe', 'post-war', \"not a 'hermione goes to school with tom riddle fic'\", 'dark', 'psycho"| __truncated__ ...
##  $ Complete      : Factor w/ 2 levels "Complete Work",..: 2 2 2 2 2 2 2 2 2 1 ...
##  $ Language      : chr  "English" "English" "English" "English" ...
##  $ Word_count    : num  205637 270566 69440 215087 114404 ...
##  $ Num_chapters  : int  42 46 112 39 78 57 37 70 12 38 ...
##  $ Num_comments  : int  9839 4206 4701 4466 6298 827 1641 3540 4164 974 ...
##  $ Num_kudos     : int  30895 30219 19506 15042 13946 12857 11186 10150 9587 8270 ...
##  $ Num_bookmarks : int  8695 8995 4352 3672 3377 733 2877 1939 2734 2375 ...
##  $ Num_hits      : int  1010534 1039591 557501 469704 473485 390889 274057 282140 331209 219106 ...
##  $ Is_crossover  : logi  TRUE FALSE FALSE FALSE TRUE FALSE ...
##  $ Has_romance   : logi  FALSE TRUE FALSE TRUE FALSE TRUE ...
##  $ Has_platonic  : logi  TRUE TRUE TRUE FALSE FALSE FALSE ...
##  $ Num_romance   : int  0 4 0 2 0 2 3 1 13 1 ...
##  $ Num_platonic  : int  9 3 1 0 0 0 4 0 0 3 ...
##  $ Num_characters: num  6 27 18 10 26 7 7 5 9 5 ...
##  $ Fluff         : logi  FALSE FALSE FALSE FALSE FALSE TRUE ...
##  $ Angst         : logi  TRUE FALSE TRUE FALSE FALSE TRUE ...
##  $ Hurt_comfort  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ AU            : logi  FALSE TRUE FALSE FALSE FALSE FALSE ...
##  $ Fluff_angst   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Five_tags     : logi  TRUE TRUE TRUE FALSE FALSE TRUE ...

Missing Values or NAs

After creating and changing all of the variables, one needs to see if there are any NA or missing values in the table.

# find NAs or NULLs in dataset
which(is.na(ao3))
## integer(0)

In this case, we do not have any missing values to worry about, so there is nothing to do there, but we may need to worry about rows that include zeros for the number of views or likes. If there are zeros in the target variable’s column, then it can make it difficult to calculate different values later on, such as MAPE, mean absolute percent error. Also, I am hoping to predict positive fan interaction, meaning I am hoping to predict likes above zero, and for the company or person this is presented to, they would be able to achieve at least one like for the idea.

# number of rows with 0 kudos
nrow(ao3[ao3$Num_kudos==0,])
## [1] 1906

With the number of records in the data set (40,037 total), one can easily just remove the observations without it impacting the results too heavily.

# remove columns with zero kudos
ao3 <- ao3[ao3$Num_kudos>0,]

Summary Statistics

Now, with the cleaned data set, we can look at the summary statistics to see if there is anything interesting happening with the data.

# show summary stats
summary(ao3)
##     Title              Author                ID             Fandoms         
##  Length:38131       Length:38131       Min.   :  290120   Length:38131      
##  Class :character   Class :character   1st Qu.:44890706   Class :character  
##  Mode  :character   Mode  :character   Median :45736477   Mode  :character  
##                                        Mean   :43919850                     
##                                        3rd Qu.:45804740                     
##                                        Max.   :46390825                     
##                                                                             
##   Date_updated                          Rating             Pairing     
##  Min.   :2023-03-12   Explicit             : 7877   F/F        : 2482  
##  1st Qu.:2023-03-14   General Audiences    : 8311   F/M        : 6697  
##  Median :2023-03-16   Mature               : 6531   Gen        : 5531  
##  Mean   :2023-03-16   Not Rated            : 4245   M/M        :12386  
##  3rd Qu.:2023-03-18   Teen And Up Audiences:11167   Multi      : 7130  
##  Max.   :2023-03-20                                 No category: 2949  
##                                                     Other      :  956  
##                                                           Warning     
##  No Archive Warnings Apply                                    :17705  
##  Choose Not To Use Archive Warnings                           :11261  
##  Graphic Depictions Of Violence                               : 2844  
##  Graphic Depictions Of Violence, Major Character Death        : 1194  
##  Major Character Death                                        : 1120  
##  Choose Not To Use Archive Warnings, No Archive Warnings Apply:  814  
##  (Other)                                                      : 3193  
##  Relationships       Characters            Tags          
##  Length:38131       Length:38131       Length:38131      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##              Complete       Language           Word_count     
##  Complete Work   :24045   Length:38131       Min.   :      0  
##  Work in Progress:14086   Class :character   1st Qu.:   1269  
##                           Mode  :character   Median :   3109  
##                                              Mean   :  14416  
##                                              3rd Qu.:   9806  
##                                              Max.   :3417557  
##                                                               
##   Num_chapters       Num_comments        Num_kudos       Num_bookmarks    
##  Min.   :   0.000   Min.   :    0.00   Min.   :    1.0   Min.   :   0.00  
##  1st Qu.:   1.000   1st Qu.:    1.00   1st Qu.:    9.0   1st Qu.:   1.00  
##  Median :   1.000   Median :    4.00   Median :   29.0   Median :   3.00  
##  Mean   :   4.705   Mean   :   32.42   Mean   :  112.3   Mean   :  18.62  
##  3rd Qu.:   4.000   3rd Qu.:   13.00   3rd Qu.:   90.0   3rd Qu.:  10.00  
##  Max.   :1715.000   Max.   :11426.00   Max.   :30895.0   Max.   :8995.00  
##                                                                           
##     Num_hits       Is_crossover    Has_romance     Has_platonic   
##  Min.   :      1   Mode :logical   Mode :logical   Mode :logical  
##  1st Qu.:    136   FALSE:28663     FALSE:8759      FALSE:26557    
##  Median :    418   TRUE :9468      TRUE :29372     TRUE :11574    
##  Mean   :   2506                                                  
##  3rd Qu.:   1335                                                  
##  Max.   :1039591                                                  
##                                                                   
##   Num_romance      Num_platonic     Num_characters     Fluff        
##  Min.   : 0.000   Min.   : 0.0000   Min.   : 0.000   Mode :logical  
##  1st Qu.: 1.000   1st Qu.: 0.0000   1st Qu.: 2.000   FALSE:28231    
##  Median : 1.000   Median : 0.0000   Median : 4.000   TRUE :9900     
##  Mean   : 1.445   Mean   : 0.8471   Mean   : 5.366                  
##  3rd Qu.: 2.000   3rd Qu.: 1.0000   3rd Qu.: 7.000                  
##  Max.   :47.000   Max.   :57.0000   Max.   :72.000                  
##                                                                     
##    Angst         Hurt_comfort        AU          Fluff_angst    
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:28108     FALSE:32278     FALSE:35418     FALSE:35871    
##  TRUE :10023     TRUE :5853      TRUE :2713      TRUE :2260     
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##  Five_tags      
##  Mode :logical  
##  FALSE:19830    
##  TRUE :18301    
##                 
##                 
##                 
## 

There are some interesting parts to look at in the summary statistics. A majority of the records in the data set contains at least one romantic relationship, where 77% (29,372 / (29,372 + 8,759)) of the fanfictions contain it, but with platonic relationships, this decreases to 30%. A majority of the stories are rated for teens rather than any of the other ratings, which makes sense because one would try to include as many age groups as possible for the story’s reach. Additionally, it seems that the average number of characters (at least prominent characters) in a story is five.

Exploring the Data Further

Given all of the variables in this data set, there are only a particular number of variables I wish to explore further: rating, pairing, number of comments, number of likes, number of views, whether or not it is a crossover, contains romance and number of relationships, contains platonic relationships and number of relationships, and number of characters.

Variables like ‘Title’, ‘Author’, ‘ID’, ‘Fandom’, and ‘Date_updated’ is not useful in the model I am creating since they are only needed for the website or if I were making a more specific model, such as a tailored model to determine the number of views for something in the Harry Potter franchise. While the ‘Warning’ variable can be useful in some cases, if we look at the summary results in the previous code block, a majority of the records either did not need to use warnings or chose not to use them, so the results can be very skewed. ‘Relationships’ and ‘Characters’ are only useful for more fandom-specific models since there are just so many different characters and potential relationships that can be listed (because of all of the unique names), which is why the broader variables, such as ‘Has_romance’ and ‘Num_characters’, exist. ‘Language’ is ignored since it only consists of one value: ‘English’. ‘Word_count’, ‘Num_chapters’, and ‘Complete’ can be useful in determining likes or views for books by predicting the optimal number of words and chapter for a book, but since this data set includes fan communities from books, TV shows, and movies, it does not give as much meaning in this case. If the model was more tailored to book-only fandoms, then these variables can be more helpful.

Numerical Variables

The numerical variables are ‘Num_comments’, ‘Num_hits’, ‘Num_bookmarks’, ‘Num_characters’, ‘Num_romance’, and ‘Num_platonic’. We can look at the different numerical variables with histograms and scatter plots. First, we can look at the histograms to see how frequently a value occurs in the data set for a particular variable.

### Histograms
# Number of comments
hist(ao3$"Num_comments")

# Number of view
hist(ao3$"Num_hits")

# Number of likes
hist(ao3$"Num_kudos")

# Number of bookmarks
hist(ao3$"Num_bookmarks")

# Number of characters
hist(ao3$"Num_characters")

# Number of romantic relationships
hist(ao3$"Num_romance")

# Number of platonic relationships
hist(ao3$"Num_platonic")

All of the histograms above show that all of the numerical variables are right-skewed, meaning that the mean is likely to be greater than the median, and most of the data is towards the lower bound of the range.

Next, we can look at a scatter plot of each of the different numerical variables to see if there is a relationship between number of likes, since this is the target variable, and a specific numerical variable. I use ‘ggplot2’ to plot the different scatter plots.

library(ggplot2)
# Number of comments
ggplot(ao3, aes(x=Num_comments, y=Num_kudos)) +
    geom_point() +
    geom_smooth(method=lm)

# Number of view
ggplot(ao3, aes(x=Num_hits, y=Num_kudos)) +
    geom_point() +
    geom_smooth(method=lm)

# Number of bookmarks
ggplot(ao3, aes(x=Num_bookmarks, y=Num_kudos)) +
    geom_point() +
    geom_smooth(method=lm)

# Number of characters
ggplot(ao3, aes(x=Num_characters, y=Num_kudos)) +
    geom_point() +
    geom_smooth(method=lm)

# Number of romantic relationships
ggplot(ao3, aes(x=Num_romance, y=Num_kudos)) +
    geom_point() +
    geom_smooth(method=lm)

# Number of platonic relationships
ggplot(ao3, aes(x=Num_platonic, y=Num_kudos)) +
    geom_point() +
    geom_smooth(method=lm)

Looking at views and bookmarks, there seems to be a linear relationship between them, while for comments, it does seem linear, but it can be argued to have a more quadratic relationship as well – this can be tested when we make the model. For characters, romantic relationships, and platonic relationships, there does not seem to be a relation between those variables and the number of likes, meaning these may not be good indicators for predicting the number of likes.

Categorical Variables

For categorical variables, one can use a box and whisker plot to see some basic statistics. This can give insight on how different categories in a categorical variable varies in median and other variables. Additionally, one can use bar graphs to show the total sum of likes between different subcategories of a categorical variable to see if one subcategory has more of an effect than another. The categorical variables in this data set includes ‘Has_romance’, ‘Has_platonic’, ‘Rating’, ‘Pairing’, ‘Is_crossover’, ‘Fluff’, ‘Angst’, ‘Hurt_comfort’, ‘AU’, ‘Fluff_angst’, and ‘Five_tags’.

First, we will look to see if there is a difference in the number of likes if platonic or romance relationships are included in the data set.

library(plotly)

# Has_romance
plot_ly(ao3,
    y = ~Num_kudos,
    color = ~Has_romance,
    type = "box",
    boxpoints = 'suspectedoutliers') %>% 
  layout(title = "Likes by romantic relationship",
      xaxis = list(title = "Romance",
                    zeroline = FALSE),
      yaxis = list(title = "Likes",
                    zeroline = FALSE,
                    range = c(0,250)))
# bar graph of the sum of likes
ggplot(ao3, aes(x=Has_romance, y=Num_kudos)) +
    geom_bar(stat = "identity", fill='lightpink')

# Has_platonic
plot_ly(ao3,
    y = ~Num_kudos,
    color = ~Has_platonic,
    type = "box",
    boxpoints = 'suspectedoutliers') %>% 
  layout(title = "Likes by platonic relationship",
      xaxis = list(title = "Platonic",
                    zeroline = FALSE),
      yaxis = list(title = "Likes",
                    zeroline = FALSE,
                    range = c(0,300)))
ggplot(ao3, aes(x=Has_platonic, y=Num_kudos)) +
    geom_bar(stat = "identity", fill='aquamarine3')

For romantic relationships, it seems that there will be a higher median number of likes, but there is not as much of a difference if there is a platonic relationship included when looking at the box plots. With the bar graphs, romantic relationships do have an increase in the number of likes that a fanfiction receives compared to not having one whileit is the opposite for platonic relationships.

Next, one can look at rating. This consists of five subcategories: explicit, mature, teens and up audiences, general audiences, and not rated. With this different ratings, while by themselves, it will be interesting – it would be interesting to see if there is an effect on likes if were look at rating and whether or not there was a romance relationship.

# Rating
plot_ly(ao3,
    y = ~Num_kudos,
    color = ~Rating,
    type = "box",
    boxpoints = 'suspectedoutliers') %>% 
  layout(boxmode = "group",
         title = "Likes by rating",
      xaxis = list(title = "Rating",
                    zeroline = FALSE),
      yaxis = list(title = "Likes",
                    zeroline = FALSE,
                    range = c(0,350)))
ggplot(ao3, aes(x=Rating, y=Num_kudos)) +
    geom_bar(stat = "identity", fill='lightgreen')

# Rating and romance
plot_ly(ao3,
    y = ~Num_kudos,
    x = ~Has_romance,
    color = ~Rating,
    type = "box",
    boxpoints = 'suspectedoutliers') %>% 
  layout(boxmode = "group",
         title = "Likes by rating and romance",
      xaxis = list(title = "Has Romance",
                    zeroline = FALSE),
      yaxis = list(title = "Likes",
                    zeroline = FALSE,
                    range = c(0,350)))

For the original rating box plot, explicit has the highest median number of likes while the other ratings seem to have fairly similar medians. The other plot with romance and rating contains more exciting details. It seems that fanfiction has an overall higher number of likes if it contains a romantic relationship in general with the highest median number of likes in the explicit category with the mature rating close behind while the box plots for the ratings not including a romantic relationship were fairly similar.

While the original rating box plot indicates that explicit stories will receive more likes, the bar graph states something completely different: the ‘Teens and Up Audiences’ subcategory by far as the most number of total likes compared to the other ratings, though explicit is close to the same number of likes.

The pairing column is too closely related to the ‘Has_romance’ column to be compared. If there is a pairing given, it will have a romance relationship because the pairing column indicates the genders of the people in the romantic relationship (if any). Therefore, we will just use a normal plot to look at the column.

# Pairing
plot_ly(ao3,
    y = ~Num_kudos,
    color = ~Pairing,
    type = "box",
    boxpoints = 'suspectedoutliers') %>% 
  layout(boxmode = "group",
         title = "Likes by Pairing",
      xaxis = list(title = "Pairing",
                    zeroline = FALSE),
      yaxis = list(title = "Likes",
                    zeroline = FALSE,
                    range = c(0,300)))
ggplot(ao3, aes(x=Pairing, y=Num_kudos)) +
    geom_bar(stat = "identity", fill='coral1')

For this column, both box plot and bar graph lead to same conclusion: the ‘M/M’ pairing is by far the most popular pairing to receive the most number of likes, so if there is a romantic relationship, that pairing would give the most number of likes.

In TV shows or books, there are occasionally episodes that have characters or situations that tend to be common for another show or book, and this is a crossover. This variable will hopefully give some insight whether crossovers tend to be well received by people.

# Is_crossover
plot_ly(ao3,
    y = ~Num_kudos,
    color = ~Is_crossover,
    type = "box",
    boxpoints = 'suspectedoutliers') %>% 
  layout(boxmode = "group",
         title = "Likes by Crossover",
      xaxis = list(title = "Is_crossover",
                    zeroline = FALSE),
      yaxis = list(title = "Likes",
                    zeroline = FALSE,
                    range = c(0,250)))
ggplot(ao3, aes(x=Is_crossover, y=Num_kudos)) +
    geom_bar(stat = "identity", fill='mediumorchid3')

The box plot does not seem to show much of a difference between the number of likes between a fanfiction if it is a crossover or not. Though, the bar graph definitely shows a skew in the direction of non-crossover fanfictions, but this can be due to the number of crossover versus non-crossover fanfictions in the data frame.

# number of crossover fanfictions
nrow(ao3[ao3$Is_crossover==TRUE,])
## [1] 9468
# number of non-crossover fanfictions
nrow(ao3[ao3$Is_crossover==FALSE,])
## [1] 28663

From the code above, there are definitely more records of a fanfiction that is not a crossover rather than being a crossover.

The last of the categorical variables all deal with the tags included in the fanfiction. Given the frequency of the tags given earlier in the project, many of these columns may have a similar problem to the ‘Is_crossover’ column, where there are more stories without the tag than with them, which is why I created the ‘Five_tags’ column.

First, we can look at all of the box plots for the different tags (excluding the combination tag column).

#Fluff
plot_ly(ao3,
    y = ~Num_kudos,
    color = ~Fluff,
    type = "box",
    boxpoints = 'suspectedoutliers') %>% 
  layout(boxmode = "group",
         title = "Likes by Fluff",
      xaxis = list(title = "Fluff",
                    zeroline = FALSE),
      yaxis = list(title = "Likes",
                    zeroline = FALSE,
                    range = c(0,300)))
ggplot(ao3, aes(x=Fluff, y=Num_kudos)) +
    geom_bar(stat = "identity", fill='lightslateblue')

#Angst
plot_ly(ao3,
    y = ~Num_kudos,
    color = ~Angst,
    type = "box",
    boxpoints = 'suspectedoutliers') %>% 
  layout(boxmode = "group",
         title = "Likes by Angst",
      xaxis = list(title = "Angst",
                    zeroline = FALSE),
      yaxis = list(title = "Likes",
                    zeroline = FALSE,
                    range = c(0,350)))
ggplot(ao3, aes(x=Angst, y=Num_kudos)) +
    geom_bar(stat = "identity", fill='indianred2')

#Hurt_comfort
plot_ly(ao3,
    y = ~Num_kudos,
    color = ~Hurt_comfort,
    type = "box",
    boxpoints = 'suspectedoutliers') %>% 
  layout(boxmode = "group",
         title = "Likes by hurt/comfort",
      xaxis = list(title = "Hurt_comfort",
                    zeroline = FALSE),
      yaxis = list(title = "Likes",
                    zeroline = FALSE,
                    range = c(0,250)))
ggplot(ao3, aes(x=Hurt_comfort, y=Num_kudos)) +
    geom_bar(stat = "identity", fill='rosybrown1')

# Fluff_angst
plot_ly(ao3,
    y = ~Num_kudos,
    color = ~Fluff_angst,
    type = "box",
    boxpoints = 'suspectedoutliers') %>% 
  layout(boxmode = "group",
         title = "Likes by fluff and angst",
      xaxis = list(title = "Fluff_angst",
                    zeroline = FALSE),
      yaxis = list(title = "Likes",
                    zeroline = FALSE,
                    range = c(0,350)))
ggplot(ao3, aes(x=Fluff_angst, y=Num_kudos)) +
    geom_bar(stat = "identity", fill='navajowhite1')

# Alternate_universe
plot_ly(ao3,
    y = ~Num_kudos,
    color = ~AU,
    type = "box",
    boxpoints = 'suspectedoutliers') %>% 
  layout(title = "Likes by AU",
      xaxis = list(title = "Alternate universe",
                    zeroline = FALSE),
      yaxis = list(title = "Likes",
                    zeroline = FALSE,
                    range = c(0,450)))
ggplot(ao3, aes(x=AU, y=Num_kudos)) +
    geom_bar(stat = "identity", fill='cadetblue2')

As one can see from the graphs above, while four of the tags (fluff, angst, hurt/comfort, and fluff and angst) all have median likes higher than if the tag was not listed in the fanfiction, but, overall, in the bar graphs, it is almost impossible to compare the sum number of likes if the tag is include. This is why the top five tags are combined together to see if any combination of those five tags can give a better number of views.

# Five_tags
plot_ly(ao3,
    y = ~Num_kudos,
    color = ~Five_tags,
    type = "box",
    boxpoints = 'suspectedoutliers') %>% 
  layout(title = "Likes by Top Tags",
      xaxis = list(title = "Top Tags",
                    zeroline = FALSE),
      yaxis = list(title = "Likes",
                    zeroline = FALSE,
                    range = c(0,300)))
ggplot(ao3, aes(x=Five_tags, y=Num_kudos)) +
    geom_bar(stat = "identity", fill='lightsteelblue2')

In the box plot, similar to the previous plots, including one of the top five tags in one’s fanfiction will sightly increase the median number of likes. Though the data set stats that there are about 19,000 FALSE and 18,000 TRUE entries for the column, the sum number of likes for the fanfiction that include at least one of the top five tags receives more likes compared to the fanfiction that do not inlcude the tags.

After looking at all of the different visualizations for categorical variables, we can use the dplyr library in R to display some numbers for the subcategories of the variables.

library(dplyr)
# see stats with Rating
ao3 %>%
  group_by(Rating) %>%
  summarize(mean.kudos = mean(Num_kudos), min.kudos = min(Num_kudos), max.kudos = max(Num_kudos))

For rating, the explicit rating has the highest mean of likes (kudos), but the rating with the highest number of overall likes is a fanfiction that is not rated followed closely by a mature rating fanfiction.

ao3 %>%
  group_by(Pairing) %>%
  summarize(mean.kudos = mean(Num_kudos), min.kudos = min(Num_kudos), max.kudos = max(Num_kudos))

Surprisingly, the numerical results for the pairing column show that having multiple types of pairings will give the highest average number of likes while the fanfiction with the highest number of likes overall is part of the ‘Gen’ pairing, meaning there are no romantic pairings, which is closely followed by the ‘Multi’ subcategory.

# see stats with the different tags
ao3 %>%
  group_by(Fluff,Angst,AU,Hurt_comfort,Fluff_angst,Five_tags) %>%
  summarize(mean.kudos = mean(Num_kudos), min.kudos = min(Num_kudos), max.kudos = max(Num_kudos))

This way of grouping aids us in understanding exactly which combinations of the top five tags give the highest number of likes. If a fanfiction includes both angst and hurt/comfort, then it will have the highest mean number of likes, followed closely by fanfictions including angst, alternate universe - canon divergence, and hurt/comfort, while the fanfiction with the highest overall number of likes only includes angst.

Model Selection

Of the different models that were learned about in class, the research question can be most easily answered with a multiple linear regression. There are several strengths of using this model, such as its popularity and adaptability. Additionally, it can provide estimates of the strength and size of the relationships, features, and outcome. Though it is a very common approach for predicting variables, there are also several weaknesses in the model. Linear regression does not handle missing data well, the equation usually needs to be formed by the person, and it makes strong assumptions about the data. Luckily, missing data was handled earlier in the code, and there are methods a person can use to determine the best variables to form a model. Some of the assumptions made when choosing a regression model includes little to no multicollinearity, model can be expressed linearly, residuals are normally distributed, and variance of the errors are homoskedastic.

Choosing the Numerical Variables

To choose the best numerical variables, one should look at the correlation between the target variable and the other variables. We can use the ‘psych’ library in R to see the different correlations, and it will give insight to see if there are other variables that can cause multicollinearity if put into the model. I use the stars method to show the significance of the correlation in the table.

library(psych)
pairs.panels(ao3[c("Num_kudos", "Num_bookmarks", "Num_hits", "Num_comments", "Num_romance", "Num_platonic", "Num_characters")], stars = TRUE)

While ‘Num_bookmarks’ has a high correlation with likes (‘Num_kudos’), since there is no real company or media equivalent of the variable, because bookmarks are the number of people who decide to save a fanfiction, it will have to be ignored for the model. ‘Num_hits’ and ‘Num_comments’, on the other hand, also have a high correlation and make more sense has variables for a company or author can use to gauge the number of likes. The only worry with using both of the variables is that they are also highly correlated with each other, so this may cause multicollinearity in the model. The last three variables involving the relationships and characters have almost no correlation with the target variable, making them very unlikely predictors for the model. Of the numerical variables, ‘Num_hits’ and ‘Num_comments’ seem to be the most likely contenders.

Multicollinearity

To test the multicollinearity between the variables, we can calculate the VIF. A standard rule for VIF is if it is under five, then there is not any multicollinearity.

library(car)
# make model to test multicollinearity
m <- lm(Num_kudos ~ Num_hits + Num_comments, data = ao3)
summary(m)
## 
## Call:
## lm(formula = Num_kudos ~ Num_hits + Num_comments, data = ao3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18261.3    -43.7    -32.6      1.6   5909.9 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.676e+01  1.044e+00   44.78   <2e-16 ***
## Num_hits     2.206e-02  1.044e-04  211.25   <2e-16 ***
## Num_comments 3.169e-01  7.978e-03   39.72   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 200.5 on 38128 degrees of freedom
## Multiple R-squared:  0.7706, Adjusted R-squared:  0.7705 
## F-statistic: 6.402e+04 on 2 and 38128 DF,  p-value: < 2.2e-16
vif(m)
##     Num_hits Num_comments 
##     2.186333     2.186333

Since both of the values are below five, we do not need to worry about multicollinearity.

Significance of Squared Terms

Next, we need to test if nonlinear terms are needed for the model. To test for squared terms in the model, we can use the Ramsey Reset test, where the null hypothesis is that the linear assumption is correct.

library(lmtest)
# test number of views
resettest(Num_kudos ~ Num_hits, power = 2, type = c("fitted", "regressor",
  "princomp"), data = ao3)
## 
##  RESET test
## 
## data:  Num_kudos ~ Num_hits
## RESET = 1764.3, df1 = 1, df2 = 38128, p-value < 2.2e-16
# test number of comments
resettest(Num_kudos ~ Num_comments, power = 2, type = c("fitted", "regressor",
  "princomp"), data = ao3)
## 
##  RESET test
## 
## data:  Num_kudos ~ Num_comments
## RESET = 2197.2, df1 = 1, df2 = 38128, p-value < 2.2e-16

According to the test, both numerical variables can benefit the model by including their non-linear specifications. To best see if they do help the model, one should create univariate regressions including the new squared term.

# create squared term for both variables
ao3$hits2 <- ao3$Num_hits * ao3$Num_hits
ao3$comments2 <- ao3$Num_comments * ao3$Num_comments
# views
s1 <- lm(Num_kudos ~ Num_hits + hits2, data = ao3)
summary(s1)
## 
## Call:
## lm(formula = Num_kudos ~ Num_hits + hits2, data = ao3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1034.4   -18.9   -12.5     7.5  3429.9 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.776e+01  5.981e-01   29.70   <2e-16 ***
## Num_hits     4.851e-02  2.794e-04  173.58   <2e-16 ***
## hits2       -4.683e-07  9.432e-09  -49.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 99.55 on 37836 degrees of freedom
##   (292 observations deleted due to missingness)
## Multiple R-squared:  0.7058, Adjusted R-squared:  0.7058 
## F-statistic: 4.539e+04 on 2 and 37836 DF,  p-value: < 2.2e-16
#comments
s2 <- lm(Num_kudos ~ Num_comments + comments2, data = ao3)
summary(s2)
## 
## Call:
## lm(formula = Num_kudos ~ Num_comments + comments2, data = ao3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8194.4   -46.7   -34.7     3.4 23294.1 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.974e+01  1.514e+00   32.85   <2e-16 ***
## Num_comments  2.041e+00  1.288e-02  158.49   <2e-16 ***
## comments2    -9.669e-05  2.063e-06  -46.87   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 287.3 on 38128 degrees of freedom
## Multiple R-squared:  0.5291, Adjusted R-squared:  0.5291 
## F-statistic: 2.142e+04 on 2 and 38128 DF,  p-value: < 2.2e-16

In both of the univariate models, all of the variables were significant, illustrating that they make be important variables to include in the overall model.

Significance of Categorical Variables

To see if the categorical variables are significant to predicting the number of likes, we will use a one-way anova. This will tell us if the depedent variable (Num_kudos) changes depending on the level/subcategory of the independent variable.

# Rating
rating.aov <- aov(Num_kudos ~ Rating, data = ao3)
summary(rating.aov)
##                Df    Sum Sq Mean Sq F value Pr(>F)    
## Rating          4 3.442e+07 8604326   49.34 <2e-16 ***
## Residuals   38126 6.649e+09  174395                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Pairing
pairing.aov <- aov(Num_kudos ~ Pairing, data = ao3)
summary(pairing.aov)
##                Df    Sum Sq Mean Sq F value   Pr(>F)    
## Pairing         6 1.270e+07 2116132   12.09 1.25e-13 ***
## Residuals   38124 6.671e+09  174974                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Five_tags
tags.aov <- aov(Num_kudos ~ Five_tags, data = ao3)
summary(tags.aov)
##                Df    Sum Sq  Mean Sq F value Pr(>F)    
## Five_tags       1 3.014e+07 30139636   172.7 <2e-16 ***
## Residuals   38129 6.653e+09   174493                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Is_crossover
cross.aov <- aov(Num_kudos ~ Is_crossover, data = ao3)
summary(cross.aov)
##                 Df    Sum Sq Mean Sq F value   Pr(>F)    
## Is_crossover     1 5.108e+06 5108205   29.16 6.69e-08 ***
## Residuals    38129 6.678e+09  175150                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Has_romance
romance.aov <- aov(Num_kudos ~ Has_romance, data = ao3)
summary(romance.aov)
##                Df    Sum Sq Mean Sq F value   Pr(>F)    
## Has_romance     1 8.256e+06 8255500   47.16 6.66e-12 ***
## Residuals   38129 6.675e+09  175067                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Has_platonic
platonic.aov <- aov(Num_kudos ~ Has_platonic, data = ao3)
summary(platonic.aov)
##                 Df    Sum Sq  Mean Sq F value Pr(>F)    
## Has_platonic     1 1.537e+07 15365391   87.86 <2e-16 ***
## Residuals    38129 6.668e+09   174881                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

According to the results of all of the different one-way ANOVA tests, all of the categorical variables seem to be significant, meaning we should look at the univariate regression results for each variable.

# Rating
rating.lm <- lm(Num_kudos ~ Rating, data = ao3)
summary(rating.lm)
## 
## Call:
## lm(formula = Num_kudos ~ Rating, data = ao3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -148.5  -102.9   -64.9   -17.9 30796.1 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  149.532      4.705  31.780  < 2e-16 ***
## RatingGeneral Audiences      -83.597      6.567 -12.730  < 2e-16 ***
## RatingMature                 -10.123      6.989  -1.448    0.147    
## RatingNot Rated              -50.671      7.951  -6.373 1.88e-10 ***
## RatingTeen And Up Audiences  -39.657      6.145  -6.454 1.10e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 417.6 on 38126 degrees of freedom
## Multiple R-squared:  0.00515,    Adjusted R-squared:  0.005045 
## F-statistic: 49.34 on 4 and 38126 DF,  p-value: < 2.2e-16
# Pairing
pairing.lm <- lm(Num_kudos ~ Pairing, data = ao3)
summary(pairing.lm)
## 
## Call:
## lm(formula = Num_kudos ~ Pairing, data = ao3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -134.6  -101.7   -74.7   -21.7 30780.2 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         102.759      8.396  12.239  < 2e-16 ***
## PairingF/M           -3.026      9.830  -0.308 0.758182    
## PairingGen           11.996     10.106   1.187 0.235248    
## PairingM/M           16.912      9.199   1.838 0.066003 .  
## PairingMulti         32.845      9.749   3.369 0.000755 ***
## PairingNo category  -29.361     11.394  -2.577 0.009975 ** 
## PairingOther        -40.267     15.922  -2.529 0.011445 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 418.3 on 38124 degrees of freedom
## Multiple R-squared:  0.0019, Adjusted R-squared:  0.001743 
## F-statistic: 12.09 on 6 and 38124 DF,  p-value: 1.255e-13
# Five_tags
tags.lm <- lm(Num_kudos ~ Five_tags, data = ao3)
summary(tags.lm)
## 
## Call:
## lm(formula = Num_kudos ~ Five_tags, data = ao3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -140.6   -98.6   -72.3   -20.3 30753.4 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     85.314      2.966   28.76   <2e-16 ***
## Five_tagsTRUE   56.274      4.282   13.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 417.7 on 38129 degrees of freedom
## Multiple R-squared:  0.00451,    Adjusted R-squared:  0.004484 
## F-statistic: 172.7 on 1 and 38129 DF,  p-value: < 2.2e-16
# Is_crossover
cross.lm <- lm(Num_kudos ~ Is_crossover, data = ao3)
summary(cross.lm)
## 
## Call:
## lm(formula = Num_kudos ~ Is_crossover, data = ao3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -131.5   -99.7   -80.7   -21.7 30762.5 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       105.671      2.472   42.75  < 2e-16 ***
## Is_crossoverTRUE   26.791      4.961    5.40 6.69e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 418.5 on 38129 degrees of freedom
## Multiple R-squared:  0.0007643,  Adjusted R-squared:  0.0007381 
## F-statistic: 29.16 on 1 and 38129 DF,  p-value: 6.687e-08
# Has_romance
romance.lm <- lm(Num_kudos ~ Has_romance, data = ao3)
summary(romance.lm)
## 
## Call:
## lm(formula = Num_kudos ~ Has_romance, data = ao3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -119.4  -103.4   -78.4   -23.4 30809.6 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       85.378      4.471  19.097  < 2e-16 ***
## Has_romanceTRUE   34.980      5.094   6.867 6.66e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 418.4 on 38129 degrees of freedom
## Multiple R-squared:  0.001235,   Adjusted R-squared:  0.001209 
## F-statistic: 47.16 on 1 and 38129 DF,  p-value: 6.655e-12
# Has_platonic
platonic.lm <- lm(Num_kudos ~ Has_platonic, data = ao3)
summary(platonic.lm)
## 
## Call:
## lm(formula = Num_kudos ~ Has_platonic, data = ao3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -141.7   -96.1   -78.1   -20.7 30752.3 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        99.071      2.566  38.607   <2e-16 ***
## Has_platonicTRUE   43.660      4.658   9.373   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 418.2 on 38129 degrees of freedom
## Multiple R-squared:  0.002299,   Adjusted R-squared:  0.002273 
## F-statistic: 87.86 on 1 and 38129 DF,  p-value: < 2.2e-16

In the rating model, explicit is being used as a reference variable (which is the subcategory not shown in the model), meaning that the coefficients of the model show how much more or less likes a fanfiction can receive if it chose a different rating compared to a explicit rating. Of the different subcategories, mature was the only one that was not significant according to the p-value since the values of mature and explicit ratings were similar this does make some sense.

For the pairing model, the ‘F/F’ relationship (meaning a relationship between a female and female) was the reference group. If we are to keep an alpha or significance of 0.05, three of the subcategories, ‘F/M’, ‘Gen’, and ‘M/M’, will not be significant, making me more hesitant to use this variable by itself, but this variable may be useful for interaction terms, so it will not be completely discarded yet.

The last four binary variables all had significant coefficient, where the FALSE counterpart was the reference for all of them. The coefficients for all of the models were positive, meaning that if this item is included, then a fanfiction will likely receive more likes.

Interaction Terms

To test the interaction between the two numerical variables, we can run a small regression model to see if the interaction term is significant in it.

hits.comments <- lm(Num_kudos ~ Num_hits + Num_comments + Num_hits*Num_comments, data = ao3)
summary(hits.comments)
## 
## Call:
## lm(formula = Num_kudos ~ Num_hits + Num_comments + Num_hits * 
##     Num_comments, data = ao3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18208.2    -44.6    -33.4      1.2   5774.6 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           4.782e+01  1.061e+00  45.073  < 2e-16 ***
## Num_hits              2.167e-02  1.257e-04 172.350  < 2e-16 ***
## Num_comments          3.058e-01  8.222e-03  37.188  < 2e-16 ***
## Num_hits:Num_comments 1.340e-07  2.411e-08   5.556 2.77e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 200.5 on 38127 degrees of freedom
## Multiple R-squared:  0.7707, Adjusted R-squared:  0.7707 
## F-statistic: 4.273e+04 on 3 and 38127 DF,  p-value: < 2.2e-16

The model above shows that the interaction term between the two numerical variables is significant and can help the model.

Some categorical variable interaction that may be interesting to look into is ‘Pairing’ and ‘Has_romance’, ‘Rating’ and ‘Has_romance’, ‘Five_tags’ and ‘Has_romance’, and ‘Rating’ and ‘Five_tags’. We, again, can use ANOVA to determine whether the interaction terms are significant or not.

# check interaction between pairing and romance
pair.romance <- aov(Num_kudos ~ Pairing * Has_romance, data = ao3)
summary(pair.romance)
##                        Df    Sum Sq Mean Sq F value   Pr(>F)    
## Pairing                 6 1.270e+07 2116132  12.108 1.21e-13 ***
## Has_romance             1 8.235e+06 8234888  47.118 6.79e-12 ***
## Pairing:Has_romance     6 7.151e+05  119179   0.682    0.664    
## Residuals           38117 6.662e+09  174771                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# rating and romance interaction
rate.romance <- aov(Num_kudos ~ Rating * Has_romance, data = ao3)
summary(rate.romance)
##                       Df    Sum Sq Mean Sq F value   Pr(>F)    
## Rating                 4 3.442e+07 8604326  49.393  < 2e-16 ***
## Has_romance            1 2.282e+06 2282115  13.100 0.000296 ***
## Rating:Has_romance     4 5.931e+06 1482720   8.511 7.34e-07 ***
## Residuals          38121 6.641e+09  174202                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# five tags and romance interaction
tags.romance <- aov(Num_kudos ~ Five_tags * Has_romance, data = ao3)
summary(tags.romance)
##                          Df    Sum Sq  Mean Sq F value   Pr(>F)    
## Five_tags                 1 3.014e+07 30139636 172.895  < 2e-16 ***
## Has_romance               1 5.433e+06  5433443  31.169 2.38e-08 ***
## Five_tags:Has_romance     1 1.387e+06  1386907   7.956   0.0048 ** 
## Residuals             38127 6.646e+09   174324                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# rating and five tags
rate.tags <- aov(Num_kudos ~ Rating * Five_tags, data = ao3)
summary(rate.tags)
##                     Df    Sum Sq  Mean Sq F value   Pr(>F)    
## Rating               4 3.442e+07  8604326   49.64  < 2e-16 ***
## Five_tags            1 3.541e+07 35411890  204.28  < 2e-16 ***
## Rating:Five_tags     4 5.457e+06  1364221    7.87 2.46e-06 ***
## Residuals        38121 6.608e+09   173346                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There does not seem to be an interaction between pairing and whether there is a romantic relationship or not, but there does seem to be an interaction between rating and whether the story contains a romantic relationship or not, which, at a glance, does make sense because with romantic relationships included in media, sometimes is does tend to contain more adult content. Additionally, it seems that there is an interaction between the top five tags and whether or not a romantic relationship is included. Rating and tags also interact with one another, which makes sense because depending on the tag it would affect the overall rating of the fanfiction.

# rating and romance interaction
rate.romance.lm <- lm(Num_kudos ~ Rating * Has_romance + Rating + Has_romance, data = ao3)
summary(rate.romance.lm)
## 
## Call:
## lm(formula = Num_kudos ~ Rating * Has_romance + Rating + Has_romance, 
##     data = ao3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -155.1  -101.1   -65.1   -18.6 30804.0 
## 
## Coefficients:
##                                             Estimate Std. Error t value
## (Intercept)                                    47.13      19.13   2.463
## RatingGeneral Audiences                        12.98      20.63   0.629
## RatingMature                                   44.16      23.36   1.891
## RatingNot Rated                                43.83      22.57   1.942
## RatingTeen And Up Audiences                    63.43      20.52   3.091
## Has_romanceTRUE                               108.99      19.74   5.523
## RatingGeneral Audiences:Has_romanceTRUE       -99.99      21.94  -4.557
## RatingMature:Has_romanceTRUE                  -52.48      24.50  -2.142
## RatingNot Rated:Has_romanceTRUE               -97.92      24.30  -4.030
## RatingTeen And Up Audiences:Has_romanceTRUE  -109.94      21.59  -5.091
##                                             Pr(>|t|)    
## (Intercept)                                  0.01377 *  
## RatingGeneral Audiences                      0.52921    
## RatingMature                                 0.05867 .  
## RatingNot Rated                              0.05212 .  
## RatingTeen And Up Audiences                  0.00199 ** 
## Has_romanceTRUE                             3.36e-08 ***
## RatingGeneral Audiences:Has_romanceTRUE     5.20e-06 ***
## RatingMature:Has_romanceTRUE                 0.03223 *  
## RatingNot Rated:Has_romanceTRUE             5.59e-05 ***
## RatingTeen And Up Audiences:Has_romanceTRUE 3.58e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 417.4 on 38121 degrees of freedom
## Multiple R-squared:  0.006379,   Adjusted R-squared:  0.006144 
## F-statistic: 27.19 on 9 and 38121 DF,  p-value: < 2.2e-16
# five tags and romance interaction
tags.romance.lm <- lm(Num_kudos ~ Five_tags * Has_romance + Five_tags + Has_romance, data = ao3)
summary(tags.romance.lm)
## 
## Call:
## lm(formula = Num_kudos ~ Five_tags * Has_romance + Five_tags + 
##     Has_romance, data = ao3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -142.7   -97.7   -68.3   -20.3 30762.7 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     55.669      5.701   9.765  < 2e-16 ***
## Five_tagsTRUE                   76.650      9.157   8.371  < 2e-16 ***
## Has_romanceTRUE                 40.638      6.675   6.089 1.15e-09 ***
## Five_tagsTRUE:Has_romanceTRUE  -29.257     10.373  -2.821   0.0048 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 417.5 on 38127 degrees of freedom
## Multiple R-squared:  0.00553,    Adjusted R-squared:  0.005452 
## F-statistic: 70.67 on 3 and 38127 DF,  p-value: < 2.2e-16
# rating and five tags
rate.tags.lm <- lm(Num_kudos ~ Rating * Five_tags + Rating + Five_tags, data = ao3)
summary(rate.tags.lm)
## 
## Call:
## lm(formula = Num_kudos ~ Rating * Five_tags + Rating + Five_tags, 
##     data = ao3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -204.2   -97.2   -63.3   -16.9 30765.8 
## 
## Coefficients:
##                                           Estimate Std. Error t value Pr(>|t|)
## (Intercept)                                118.408      5.858  20.212  < 2e-16
## RatingGeneral Audiences                    -66.500      8.733  -7.615 2.70e-14
## RatingMature                               -26.838      9.400  -2.855  0.00430
## RatingNot Rated                            -46.142     10.533  -4.381 1.19e-05
## RatingTeen And Up Audiences                -36.889      8.235  -4.480 7.49e-06
## Five_tagsTRUE                               86.754      9.781   8.870  < 2e-16
## RatingGeneral Audiences:Five_tagsTRUE      -58.850     13.382  -4.398 1.10e-05
## RatingMature:Five_tagsTRUE                   7.269     14.208   0.512  0.60890
## RatingNot Rated:Five_tagsTRUE              -29.821     16.115  -1.850  0.06425
## RatingTeen And Up Audiences:Five_tagsTRUE  -33.898     12.573  -2.696  0.00702
##                                              
## (Intercept)                               ***
## RatingGeneral Audiences                   ***
## RatingMature                              ** 
## RatingNot Rated                           ***
## RatingTeen And Up Audiences               ***
## Five_tagsTRUE                             ***
## RatingGeneral Audiences:Five_tagsTRUE     ***
## RatingMature:Five_tagsTRUE                   
## RatingNot Rated:Five_tagsTRUE             .  
## RatingTeen And Up Audiences:Five_tagsTRUE ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 416.3 on 38121 degrees of freedom
## Multiple R-squared:  0.01126,    Adjusted R-squared:  0.01103 
## F-statistic: 48.26 on 9 and 38121 DF,  p-value: < 2.2e-16

Significant Terms for the Model

After doing all of these different tests, we can now list all of the potential terms that will be best for the model to fit all of the data. The numerical variables will be ‘Num_hits’ and ‘Num_comments’ because they were significant in the univariate regression results and high correlation. Although the two variables did seem to have a high correlation between themselves, when I checked the VIF of both variables, it was under 5, meaning that there was not any mutlicollinearity present. ‘Num_bookmarks’ could not be included, despite its high correlation, because for the purpose of the research question, the model needs to be easily able to be translated for companies and/or authors to use for there purposes, and there is not an easy way to determine whether a viewer/reader liked a story enough to save it for themselves.

Of the several categorical variables, ‘Rating’, ‘Has_romance’, ‘Has_platonic’, ‘Five_tags’, and ‘Is_crossover’ seemed to all be good predictors for the number of likes. This was determined by doing an ANOVA to see if the variables were significant and checking the results of each variable with a univariate regression. I decided not to include the ‘Pairing’ column because while it was considered significant with the ANOVA, the results of the univariate regression displayed that some of the subcategories of the column were not significant. While it is usually all right for some subcategories to not be significant, the most popular subcategory ‘M/M’ was not considered to be significant, which led me to not use the variable.

To test for nonlinear terms for the numerical variables, I used a Ramsey Reset test to test if the numerical variables would be significant if the terms were squared. For both ‘Num_hits’ and ‘Num_comments’, they had a low p-value, meaning that squared values may be significant to the model. Then, I ran regressions including the squared term with the original term, and it showed that the terms were significant that way as well.

Lastly, I tested to see if there was any interaction between variables. For the numerical variables, I ran a simple regression to see if there was interaction between ‘Num_hits’ and ‘Num_comments’. The results of the model showed that the interaction term was significant. Additionally, there seemed to be an interaction between ‘Rating’ and ‘Has_romance’, ‘Five_tags’ and ‘Has_romance’, and ‘Rating’ and ‘Five_tags’. This was determined by doing an ANOVA and checking with a regression model to determine if it is significant. Unfortunately, ‘Pairing’ and ‘Has_romance’ was not considered significant when I ran the ANOVA.

Model Analysis

Spliting the Data Set

We need to split the data set to best determine how well a model performed, so we need to create a ‘train’ set for the model to learn with, and we will have a ‘test’ set of values that the model has never seen before and have it predict the target in the ‘test’ set. We also use this to calculate MAPE (mean absolute percent error) or RMSE (root mean squared error) to see how well one model performed compared to another model.

In the code below, I use a 80/20 split, meaning that 80% of the data is being used as training data while the other 20% is being used for testing. Then, I use the summary command to see if the values for the numerical values are similar and the proportion of the categorical variables are similar to make sure the data was split properly.

# set seed to have same split
set.seed(42)
# split data into train and test sets (80/20 split default)
index <- sample(1:nrow(ao3),nrow(ao3)*0.8,replace=FALSE)
train<-ao3[index,]
test<-ao3[-index,]
summary(train)
##     Title              Author                ID             Fandoms         
##  Length:30504       Length:30504       Min.   :  426931   Length:30504      
##  Class :character   Class :character   1st Qu.:44887980   Class :character  
##  Mode  :character   Mode  :character   Median :45736150   Mode  :character  
##                                        Mean   :43928837                     
##                                        3rd Qu.:45804638                     
##                                        Max.   :46390825                     
##                                                                             
##   Date_updated                          Rating            Pairing    
##  Min.   :2023-03-12   Explicit             :6263   F/F        :1971  
##  1st Qu.:2023-03-14   General Audiences    :6680   F/M        :5326  
##  Median :2023-03-16   Mature               :5207   Gen        :4432  
##  Mean   :2023-03-16   Not Rated            :3422   M/M        :9926  
##  3rd Qu.:2023-03-18   Teen And Up Audiences:8932   Multi      :5664  
##  Max.   :2023-03-20                                No category:2413  
##                                                    Other      : 772  
##                                                           Warning     
##  No Archive Warnings Apply                                    :14161  
##  Choose Not To Use Archive Warnings                           : 9062  
##  Graphic Depictions Of Violence                               : 2298  
##  Graphic Depictions Of Violence, Major Character Death        :  950  
##  Major Character Death                                        :  906  
##  Choose Not To Use Archive Warnings, No Archive Warnings Apply:  626  
##  (Other)                                                      : 2501  
##  Relationships       Characters            Tags          
##  Length:30504       Length:30504       Length:30504      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##              Complete       Language           Word_count     
##  Complete Work   :19243   Length:30504       Min.   :      0  
##  Work in Progress:11261   Class :character   1st Qu.:   1266  
##                           Mode  :character   Median :   3110  
##                                              Mean   :  14431  
##                                              3rd Qu.:   9815  
##                                              Max.   :3417557  
##                                                               
##   Num_chapters      Num_comments        Num_kudos       Num_bookmarks    
##  Min.   :  0.000   Min.   :    0.00   Min.   :    1.0   Min.   :   0.00  
##  1st Qu.:  1.000   1st Qu.:    1.00   1st Qu.:    9.0   1st Qu.:   1.00  
##  Median :  1.000   Median :    4.00   Median :   30.0   Median :   3.00  
##  Mean   :  4.685   Mean   :   33.32   Mean   :  113.7   Mean   :  18.96  
##  3rd Qu.:  4.000   3rd Qu.:   13.00   3rd Qu.:   91.0   3rd Qu.:  11.00  
##  Max.   :448.000   Max.   :11426.00   Max.   :30895.0   Max.   :8995.00  
##                                                                          
##     Num_hits       Is_crossover    Has_romance     Has_platonic   
##  Min.   :      1   Mode :logical   Mode :logical   Mode :logical  
##  1st Qu.:    136   FALSE:22840     FALSE:7055      FALSE:21293    
##  Median :    420   TRUE :7664      TRUE :23449     TRUE :9211     
##  Mean   :   2562                                                  
##  3rd Qu.:   1342                                                  
##  Max.   :1039591                                                  
##                                                                   
##   Num_romance      Num_platonic   Num_characters     Fluff        
##  Min.   : 0.000   Min.   : 0.00   Min.   : 0.000   Mode :logical  
##  1st Qu.: 1.000   1st Qu.: 0.00   1st Qu.: 2.000   FALSE:22582    
##  Median : 1.000   Median : 0.00   Median : 4.000   TRUE :7922     
##  Mean   : 1.442   Mean   : 0.85   Mean   : 5.379                  
##  3rd Qu.: 2.000   3rd Qu.: 1.00   3rd Qu.: 7.000                  
##  Max.   :47.000   Max.   :57.00   Max.   :72.000                  
##                                                                   
##    Angst         Hurt_comfort        AU          Fluff_angst    
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:22511     FALSE:25834     FALSE:28347     FALSE:28691    
##  TRUE :7993      TRUE :4670      TRUE :2157      TRUE :1813     
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##  Five_tags           hits2             comments2        
##  Mode :logical   Min.   :1.000e+00   Min.   :        0  
##  FALSE:15870     1st Qu.:1.796e+04   1st Qu.:        1  
##  TRUE :14634     Median :1.722e+05   Median :       16  
##                  Mean   :2.038e+07   Mean   :    42720  
##                  3rd Qu.:1.690e+06   3rd Qu.:      169  
##                  Max.   :2.132e+09   Max.   :130553476  
##                  NA's   :236
summary(test)
##     Title              Author                ID             Fandoms         
##  Length:7627        Length:7627        Min.   :  290120   Length:7627       
##  Class :character   Class :character   1st Qu.:44909516   Class :character  
##  Mode  :character   Mode  :character   Median :45738196   Mode  :character  
##                                        Mean   :43883905                     
##                                        3rd Qu.:45805094                     
##                                        Max.   :46313992                     
##                                                                             
##   Date_updated                          Rating            Pairing    
##  Min.   :2023-03-12   Explicit             :1614   F/F        : 511  
##  1st Qu.:2023-03-14   General Audiences    :1631   F/M        :1371  
##  Median :2023-03-16   Mature               :1324   Gen        :1099  
##  Mean   :2023-03-16   Not Rated            : 823   M/M        :2460  
##  3rd Qu.:2023-03-18   Teen And Up Audiences:2235   Multi      :1466  
##  Max.   :2023-03-20                                No category: 536  
##                                                    Other      : 184  
##                                                           Warning    
##  No Archive Warnings Apply                                    :3544  
##  Choose Not To Use Archive Warnings                           :2199  
##  Graphic Depictions Of Violence                               : 546  
##  Graphic Depictions Of Violence, Major Character Death        : 244  
##  Major Character Death                                        : 214  
##  Choose Not To Use Archive Warnings, No Archive Warnings Apply: 188  
##  (Other)                                                      : 692  
##  Relationships       Characters            Tags          
##  Length:7627        Length:7627        Length:7627       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##              Complete      Language           Word_count     
##  Complete Work   :4802   Length:7627        Min.   :      0  
##  Work in Progress:2825   Class :character   1st Qu.:   1284  
##                          Mode  :character   Median :   3097  
##                                             Mean   :  14357  
##                                             3rd Qu.:   9740  
##                                             Max.   :1047347  
##                                                              
##   Num_chapters       Num_comments       Num_kudos      Num_bookmarks    
##  Min.   :   0.000   Min.   :   0.00   Min.   :   1.0   Min.   :   0.00  
##  1st Qu.:   1.000   1st Qu.:   1.00   1st Qu.:   9.0   1st Qu.:   0.00  
##  Median :   1.000   Median :   4.00   Median :  29.0   Median :   3.00  
##  Mean   :   4.785   Mean   :  28.85   Mean   : 106.8   Mean   :  17.27  
##  3rd Qu.:   4.000   3rd Qu.:  12.00   3rd Qu.:  88.0   3rd Qu.:  10.00  
##  Max.   :1715.000   Max.   :4426.00   Max.   :6183.0   Max.   :1413.00  
##                                                                         
##     Num_hits      Is_crossover    Has_romance     Has_platonic   
##  Min.   :     1   Mode :logical   Mode :logical   Mode :logical  
##  1st Qu.:   136   FALSE:5823      FALSE:1704      FALSE:5264     
##  Median :   404   TRUE :1804      TRUE :5923      TRUE :2363     
##  Mean   :  2285                                                  
##  3rd Qu.:  1304                                                  
##  Max.   :207582                                                  
##                                                                  
##   Num_romance      Num_platonic     Num_characters     Fluff        
##  Min.   : 0.000   Min.   : 0.0000   Min.   : 0.000   Mode :logical  
##  1st Qu.: 1.000   1st Qu.: 0.0000   1st Qu.: 2.000   FALSE:5649     
##  Median : 1.000   Median : 0.0000   Median : 4.000   TRUE :1978     
##  Mean   : 1.458   Mean   : 0.8355   Mean   : 5.316                  
##  3rd Qu.: 2.000   3rd Qu.: 1.0000   3rd Qu.: 7.000                  
##  Max.   :43.000   Max.   :31.0000   Max.   :55.000                  
##                                                                     
##    Angst         Hurt_comfort        AU          Fluff_angst    
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:5597      FALSE:6444      FALSE:7071      FALSE:7180     
##  TRUE :2030      TRUE :1183      TRUE :556       TRUE :447      
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##  Five_tags           hits2             comments2       
##  Mode :logical   Min.   :1.000e+00   Min.   :       0  
##  FALSE:3960      1st Qu.:1.822e+04   1st Qu.:       1  
##  TRUE :3667      Median :1.576e+05   Median :      16  
##                  Mean   :2.207e+07   Mean   :   15543  
##                  3rd Qu.:1.579e+06   3rd Qu.:     144  
##                  Max.   :2.135e+09   Max.   :19589476  
##                  NA's   :56

Although the values in the summary table seem to be similar, we can actually test to see if the sample averages for the numerical variables and the sample proportions for the categorical variables are equal. For numerical variables, one needs to do a Welch’s T-test, where the null hypothesis is that the sample averages are equal.

# test if means are different for the numerical variables
# views
t.test(train$Num_hits, test$Num_hits)
## 
##  Welch Two Sample t-test
## 
## data:  train$Num_hits and test$Num_hits
## t = 2.139, df = 23483, p-value = 0.03245
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   23.17895 531.11545
## sample estimates:
## mean of x mean of y 
##  2561.947  2284.800
# comments
t.test(train$Num_comments, test$Num_comments)
## 
##  Welch Two Sample t-test
## 
## data:  train$Num_comments and test$Num_comments
## t = 2.4592, df = 19756, p-value = 0.01393
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.9056732 8.0195440
## sample estimates:
## mean of x mean of y 
##  33.31589  28.85328
# likes
t.test(train$Num_kudos, test$Num_kudos)
## 
##  Welch Two Sample t-test
## 
## data:  train$Num_kudos and test$Num_kudos
## t = 1.6667, df = 18349, p-value = 0.09559
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.208692 14.942321
## sample estimates:
## mean of x mean of y 
##  113.6964  106.8296

Since all of the numerical variables had high p-values (above the standard alpha of 0.05), this means that we do not reject the null hypothesis, meaning the sample averages are equal. Next, we can test the binary categorical variables with a proportion Z-test, where the null hypothesis is that the sample proportions are equal.

# Has romance
prop.test(x = c(sum(train$Has_romance),sum(test$Has_romance)), n = c(nrow(train),nrow(test)))
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(sum(train$Has_romance), sum(test$Has_romance)) out of c(nrow(train), nrow(test))
## X-squared = 2.0885, df = 1, p-value = 0.1484
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.018423712  0.002695043
## sample estimates:
##    prop 1    prop 2 
## 0.7687189 0.7765832
# Has platonic relationships
prop.test(x = c(sum(train$Has_platonic),sum(test$Has_platonic)), n = c(nrow(train),nrow(test)))
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(sum(train$Has_platonic), sum(test$Has_platonic)) out of c(nrow(train), nrow(test))
## X-squared = 1.7461, df = 1, p-value = 0.1864
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.019528291  0.003808339
## sample estimates:
##    prop 1    prop 2 
## 0.3019604 0.3098204
# Is crossover
prop.test(x = c(sum(train$Is_crossover),sum(test$Is_crossover)), n = c(nrow(train),nrow(test)))
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(sum(train$Is_crossover), sum(test$Is_crossover)) out of c(nrow(train), nrow(test))
## X-squared = 7.0022, df = 1, p-value = 0.008141
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.003928476 0.025506753
## sample estimates:
##    prop 1    prop 2 
## 0.2512457 0.2365281
# Top five tags
prop.test(x = c(sum(train$Five_tags),sum(test$Five_tags)), n = c(nrow(train),nrow(test)))
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(sum(train$Five_tags), sum(test$Five_tags)) out of c(nrow(train), nrow(test))
## X-squared = 0.022982, df = 1, p-value = 0.8795
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.01366994  0.01156682
## sample estimates:
##    prop 1    prop 2 
## 0.4797404 0.4807919

For the binary categorical variables, all had high p-values, so the sample proportions are also equal. For ‘Rating’, I have to do some different steps to determine the proportions.

unique(ao3$Rating)
## [1] Not Rated             Mature                Teen And Up Audiences
## [4] Explicit              General Audiences    
## 5 Levels: Explicit General Audiences Mature ... Teen And Up Audiences
# make proportion table to make sure test values are correct
prop.table(table(train$Rating))
## 
##              Explicit     General Audiences                Mature 
##             0.2053173             0.2189877             0.1706989 
##             Not Rated Teen And Up Audiences 
##             0.1121820             0.2928141
prop.table(table(test$Rating))
## 
##              Explicit     General Audiences                Mature 
##             0.2116166             0.2138455             0.1735938 
##             Not Rated Teen And Up Audiences 
##             0.1079061             0.2930379
# test each sub category individually
# Explicit
prop.test(x = c(table(train$Rating)[[1]], table(test$Rating)[[1]]), n = c(nrow(train),nrow(test)))
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(table(train$Rating)[[1]], table(test$Rating)[[1]]) out of c(nrow(train), nrow(test))
## X-squared = 1.439, df = 1, p-value = 0.2303
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.016607507  0.004008927
## sample estimates:
##    prop 1    prop 2 
## 0.2053173 0.2116166
# General Audiences
prop.test(x = c(table(train$Rating)[[2]], table(test$Rating)[[2]]), n = c(nrow(train),nrow(test)))
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(table(train$Rating)[[2]], table(test$Rating)[[2]]) out of c(nrow(train), nrow(test))
## X-squared = 0.91656, df = 1, p-value = 0.3384
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.005245776  0.015530026
## sample estimates:
##    prop 1    prop 2 
## 0.2189877 0.2138455
# Mature
prop.test(x = c(table(train$Rating)[[3]], table(test$Rating)[[3]]), n = c(nrow(train),nrow(test)))
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(table(train$Rating)[[3]], table(test$Rating)[[3]]) out of c(nrow(train), nrow(test))
## X-squared = 0.34013, df = 1, p-value = 0.5598
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.012468018  0.006678244
## sample estimates:
##    prop 1    prop 2 
## 0.1706989 0.1735938
# Not Rated
prop.test(x = c(table(train$Rating)[[4]], table(test$Rating)[[4]]), n = c(nrow(train),nrow(test)))
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(table(train$Rating)[[4]], table(test$Rating)[[4]]) out of c(nrow(train), nrow(test))
## X-squared = 1.0848, df = 1, p-value = 0.2976
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.003618017  0.012169789
## sample estimates:
##    prop 1    prop 2 
## 0.1121820 0.1079061
# Teen and Up Audiences
prop.test(x = c(table(train$Rating)[[5]], table(test$Rating)[[5]]), n = c(nrow(train),nrow(test)))
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(table(train$Rating)[[5]], table(test$Rating)[[5]]) out of c(nrow(train), nrow(test))
## X-squared = 0.00059313, df = 1, p-value = 0.9806
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.01172595  0.01127828
## sample estimates:
##    prop 1    prop 2 
## 0.2928141 0.2930379

To use the proportion test, I have to test each subcategory individually. When I tested each one, all had a p-value above alpha, meaning that the proportions between the test and train sets are roughly equal.

Testing Models

Now that we know that the data is evenly split, we can finally run the first regression model and step how well the model fits the data. In this first model, I will include all the variables that were significant without nonlinear terms or interactions.

# install to determine MAPE value
library(Metrics)
model <- lm(Num_kudos ~ Num_hits + Num_comments + Rating + Five_tags + Has_romance + Has_platonic + Is_crossover, data = train)
summary(model)
## 
## Call:
## lm(formula = Num_kudos ~ Num_hits + Num_comments + Rating + Five_tags + 
##     Has_romance + Has_platonic + Is_crossover, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17947.4    -44.2    -24.5      3.0   6200.2 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 20.9893916  4.0316844   5.206 1.94e-07 ***
## Num_hits                     0.0217422  0.0001129 192.554  < 2e-16 ***
## Num_comments                 0.3175938  0.0087199  36.422  < 2e-16 ***
## RatingGeneral Audiences     -7.6627563  3.8540185  -1.988 0.046793 *  
## RatingMature                 0.8943404  3.9735311   0.225 0.821923    
## RatingNot Rated             -6.0500290  4.5369655  -1.333 0.182379    
## RatingTeen And Up Audiences  4.8698072  3.5954793   1.354 0.175611    
## Five_tagsTRUE               22.9967938  2.4672290   9.321  < 2e-16 ***
## Has_romanceTRUE             10.4122037  3.0322412   3.434 0.000596 ***
## Has_platonicTRUE            24.7366235  2.7403003   9.027  < 2e-16 ***
## Is_crossoverTRUE             2.7695445  2.7807791   0.996 0.319278    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 209.7 on 30493 degrees of freedom
## Multiple R-squared:  0.7794, Adjusted R-squared:  0.7793 
## F-statistic: 1.077e+04 on 10 and 30493 DF,  p-value: < 2.2e-16
y_pred <- predict(model, newdata=test)
mape(test$Num_kudos, y_pred)
## [1] 4.323554
AIC(model)
## [1] 412697.7

Although all of the variables were significant when they were univariate equations, ‘Is_crossover’ and ‘Rating’ (minus General Audiences) no longer was significant. But, the model itself has a high R-squared value of 0.7793, meaning 77.93% of the data is explained by this model. When I calculated the MAPE, the value was 4.32%. With MAPE, this means that the average deviation between the predicted and actual values was 4.32%. The threshold of a very good MAPE score is below 10%, though a model is still considered good with a MAPE below 20%. If we were not considered nonlinear and interaction terms, I would next remove the non-significant variables and see how that changes the model.

Next, I will run the model with the nonlinear and interaction terms to see if those added variables help with the overall model.

model1 <- lm(Num_kudos ~ Num_hits + hits2 + Num_comments + comments2 + Num_hits*Num_comments + Rating + Five_tags + Has_romance + Has_platonic + Is_crossover + Rating*Five_tags + Rating*Has_romance + Five_tags*Has_romance, data = train)
summary(model1)
## 
## Call:
## lm(formula = Num_kudos ~ Num_hits + hits2 + Num_comments + comments2 + 
##     Num_hits * Num_comments + Rating + Five_tags + Has_romance + 
##     Has_platonic + Is_crossover + Rating * Five_tags + Rating * 
##     Has_romance + Five_tags * Has_romance, data = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -991.9  -24.1   -9.7    8.8 3377.5 
## 
## Coefficients:
##                                               Estimate Std. Error t value
## (Intercept)                                 -5.351e+01  5.040e+00 -10.617
## Num_hits                                     4.649e-02  4.049e-04 114.809
## hits2                                       -6.375e-07  1.471e-08 -43.343
## Num_comments                                 1.863e-01  1.899e-02   9.808
## comments2                                   -5.540e-04  1.810e-05 -30.615
## RatingGeneral Audiences                      6.506e+01  5.478e+00  11.877
## RatingMature                                 5.695e+01  6.175e+00   9.223
## RatingNot Rated                              6.156e+01  6.001e+00  10.259
## RatingTeen And Up Audiences                  6.703e+01  5.463e+00  12.270
## Five_tagsTRUE                                1.146e+01  3.748e+00   3.058
## Has_romanceTRUE                              5.848e+01  5.180e+00  11.289
## Has_platonicTRUE                             9.621e+00  1.272e+00   7.565
## Is_crossoverTRUE                            -3.952e+00  1.282e+00  -3.083
## Num_hits:Num_comments                        2.503e-05  1.004e-06  24.924
## RatingGeneral Audiences:Five_tagsTRUE        1.067e+01  3.618e+00   2.951
## RatingMature:Five_tagsTRUE                   2.889e+00  3.732e+00   0.774
## RatingNot Rated:Five_tagsTRUE                9.357e+00  4.261e+00   2.196
## RatingTeen And Up Audiences:Five_tagsTRUE    7.895e+00  3.366e+00   2.346
## RatingGeneral Audiences:Has_romanceTRUE     -5.180e+01  5.784e+00  -8.955
## RatingMature:Has_romanceTRUE                -5.706e+01  6.421e+00  -8.886
## RatingNot Rated:Has_romanceTRUE             -5.708e+01  6.370e+00  -8.960
## RatingTeen And Up Audiences:Has_romanceTRUE -5.119e+01  5.701e+00  -8.980
## Five_tagsTRUE:Has_romanceTRUE               -1.110e+01  2.812e+00  -3.948
##                                             Pr(>|t|)    
## (Intercept)                                  < 2e-16 ***
## Num_hits                                     < 2e-16 ***
## hits2                                        < 2e-16 ***
## Num_comments                                 < 2e-16 ***
## comments2                                    < 2e-16 ***
## RatingGeneral Audiences                      < 2e-16 ***
## RatingMature                                 < 2e-16 ***
## RatingNot Rated                              < 2e-16 ***
## RatingTeen And Up Audiences                  < 2e-16 ***
## Five_tagsTRUE                                0.00223 ** 
## Has_romanceTRUE                              < 2e-16 ***
## Has_platonicTRUE                            4.00e-14 ***
## Is_crossoverTRUE                             0.00205 ** 
## Num_hits:Num_comments                        < 2e-16 ***
## RatingGeneral Audiences:Five_tagsTRUE        0.00317 ** 
## RatingMature:Five_tagsTRUE                   0.43884    
## RatingNot Rated:Five_tagsTRUE                0.02810 *  
## RatingTeen And Up Audiences:Five_tagsTRUE    0.01899 *  
## RatingGeneral Audiences:Has_romanceTRUE      < 2e-16 ***
## RatingMature:Has_romanceTRUE                 < 2e-16 ***
## RatingNot Rated:Has_romanceTRUE              < 2e-16 ***
## RatingTeen And Up Audiences:Has_romanceTRUE  < 2e-16 ***
## Five_tagsTRUE:Has_romanceTRUE               7.91e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 95.99 on 30245 degrees of freedom
##   (236 observations deleted due to missingness)
## Multiple R-squared:  0.7228, Adjusted R-squared:  0.7226 
## F-statistic:  3585 on 22 and 30245 DF,  p-value: < 2.2e-16
y_pred1 <- predict(model1, new_data=test)
mape(test$Num_kudos, y_pred1)
## [1] 8.563739
AIC(model1)
## [1] 362224.9

With the addition of nonlinear terms in the model, we cannot compare the R-squared values between this model and the previous model. Though we can use values like AIC and MAPE to determine if it is a model model compared to the previous model. Although the MAPE is higher than the original model, at 8.56%, it is still considered to be a good MAPE score. With the first model, it had an AIC of 412,697.7 while this model has an AIC of 362,224.9. The general guideline with AIC values to compare models is that the smaller of the two values is the better model, meaning that adding interaction terms and nonlinear terms helped the model to fit the data better.

Since there is a variable in the model where it has a high p-value, which is not significant, I will remove this variable, ’Rating*Five_tags’, and see how this affects the overall model.

model2 <- lm(Num_kudos ~ Num_hits + hits2 + Num_comments + comments2 + Num_hits*Num_comments + Rating + Five_tags + Has_romance + Has_platonic + Is_crossover + Rating*Has_romance + Five_tags*Has_romance, data = train)
summary(model2)
## 
## Call:
## lm(formula = Num_kudos ~ Num_hits + hits2 + Num_comments + comments2 + 
##     Num_hits * Num_comments + Rating + Five_tags + Has_romance + 
##     Has_platonic + Is_crossover + Rating * Has_romance + Five_tags * 
##     Has_romance, data = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -989.5  -23.9  -10.1    8.8 3374.9 
## 
## Coefficients:
##                                               Estimate Std. Error t value
## (Intercept)                                 -5.483e+01  5.019e+00 -10.924
## Num_hits                                     4.650e-02  4.049e-04 114.824
## hits2                                       -6.374e-07  1.471e-08 -43.333
## Num_comments                                 1.832e-01  1.897e-02   9.659
## comments2                                   -5.519e-04  1.809e-05 -30.515
## RatingGeneral Audiences                      6.736e+01  5.403e+00  12.467
## RatingMature                                 5.643e+01  6.087e+00   9.271
## RatingNot Rated                              6.333e+01  5.881e+00  10.767
## RatingTeen And Up Audiences                  6.828e+01  5.391e+00  12.665
## Five_tagsTRUE                                1.975e+01  2.400e+00   8.227
## Has_romanceTRUE                              5.777e+01  5.169e+00  11.176
## Has_platonicTRUE                             9.568e+00  1.272e+00   7.523
## Is_crossoverTRUE                            -4.036e+00  1.282e+00  -3.149
## Num_hits:Num_comments                        2.500e-05  1.004e-06  24.892
## RatingGeneral Audiences:Has_romanceTRUE     -4.921e+01  5.723e+00  -8.599
## RatingMature:Has_romanceTRUE                -5.592e+01  6.377e+00  -8.770
## RatingNot Rated:Has_romanceTRUE             -5.489e+01  6.314e+00  -8.693
## RatingTeen And Up Audiences:Has_romanceTRUE -4.911e+01  5.653e+00  -8.687
## Five_tagsTRUE:Has_romanceTRUE               -1.363e+01  2.708e+00  -5.031
##                                             Pr(>|t|)    
## (Intercept)                                  < 2e-16 ***
## Num_hits                                     < 2e-16 ***
## hits2                                        < 2e-16 ***
## Num_comments                                 < 2e-16 ***
## comments2                                    < 2e-16 ***
## RatingGeneral Audiences                      < 2e-16 ***
## RatingMature                                 < 2e-16 ***
## RatingNot Rated                              < 2e-16 ***
## RatingTeen And Up Audiences                  < 2e-16 ***
## Five_tagsTRUE                                < 2e-16 ***
## Has_romanceTRUE                              < 2e-16 ***
## Has_platonicTRUE                            5.49e-14 ***
## Is_crossoverTRUE                             0.00164 ** 
## Num_hits:Num_comments                        < 2e-16 ***
## RatingGeneral Audiences:Has_romanceTRUE      < 2e-16 ***
## RatingMature:Has_romanceTRUE                 < 2e-16 ***
## RatingNot Rated:Has_romanceTRUE              < 2e-16 ***
## RatingTeen And Up Audiences:Has_romanceTRUE  < 2e-16 ***
## Five_tagsTRUE:Has_romanceTRUE               4.91e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 96.01 on 30249 degrees of freedom
##   (236 observations deleted due to missingness)
## Multiple R-squared:  0.7227, Adjusted R-squared:  0.7225 
## F-statistic:  4380 on 18 and 30249 DF,  p-value: < 2.2e-16
y_pred2 <- predict(model2, new_data=test)
mape(test$Num_kudos, y_pred2)
## [1] 8.564222
AIC(model2)
## [1] 362228.4

With this model, the R-squared between this one and the previous one are almost the same, and this is the case with the MAPE score, too. Although the AIC is slightly higher than the previous model, all of the variables are significant (below alpha) in this model; therefore, I believe that it is a better model compared to the other nonlinear, interaction term model.

Since this model has a high R-squared, a low MAPE, and significant variables, this model is good model to use to predict positive fan interaction (or likes). Both views (Num_hits) and comments (Num_comments) have positive coefficients, meaning that for every view or comment left, this will increase the number of likes. Though, with their squared terms, the coefficients are negative. The interaction term between the two numerical variables, on the other hand, has a positive coefficient, meaning that viewing and leaving a comment will increase the number of likes. For the rating, the reference was the ‘Explicit’ rating, so when comparing the other ratings to this one, all of them had positive coefficients, meaning if the fanfiction was rated something other than ‘Explicit’ it would gain more likes. For example, if a fanfiction was rated ‘General Audiences’, it would receive approximately 67.36 more likes rather than if it was rated ‘Explicit’. For companies and authors, this means that, generally, people tend to like things that a rated differently from ‘Explicit’, specifically, in this case, the rating ‘Teen and Up Audiences’ is the most preferred (liked) rating compared to ‘Explicit’.

For the Boolean variables, ‘Has_romance’ and ‘Has_platonic’ have a positive coefficient, meaning that including romantic relationships and platonic relationships in one’s book, movie, or show will increase the overall positive reception of the piece of media. ‘Five_tags’ also received a positive coefficient, so including ‘Fluff’, ‘Angst’, ‘Hurt/comfort’, ‘Canon Divergence’, or ‘Fluff and Angst’ would overall increase the positive fan interaction. Unsurprisingly, having a crossover between two pieces of media decreases the overall positive fan interaction, meaning fans tend to enjoy the original story and plot of a book or movie rather than seeing the plot and characters being mixed into another piece of media. The interaction variable for rating and romance have negative coefficients for all of the possible ratings, meaning that if romance is included, the best number of likes would be achieved by having the piece of media rated ‘Explicit’. The last interaction term is the top five tags with romance – this one had some surprising results. If a piece of media contains any of the top five tags and included a romantic relationship, it would decrease the overall number of likes.

Test for Heteroskedasticity

A good way to determine heteroskedasticity in the model is to test with the Breusch-Pagan Lagrange Multiplier Test.

library(lmtest)
bptest(model2)
## 
##  studentized Breusch-Pagan test
## 
## data:  model2
## BP = 3443.4, df = 18, p-value < 2.2e-16

The null hypothesis of this test is that the model is homoskedastic. Since we reject the null hypothesis, this means that the model is heteroskedastic. Below, I plot the fitted values with the residual values to look at the points more closely.

residuals <- resid(model2)
fitted.values <- fitted(model2)
plot(fitted.values,residuals) + abline(0,0)

## integer(0)

Looking at the plot above, one can definitely see the distinct fanning that tends to be shown with heteroskedasticity with a large dispersion of points.

With heteroskedasticity, there are a couple of ways to deal with this problem, including transforming the response variable and using weighted regression. If one tries to take the log or square-root of the dependent variable, then this may cause the heteroskedasticity to disappear. Additionally, with weighed regression, which assigns a weight to each data point, can give smaller weights to data points with higher variance, which may decrease their squared-residuals, so if the right weights are used, one could eliminate heteroskedasticity in the model.

Test for Normality

To see if the residuals are normally distributed, one can plot a Q-Q plot. If the points are normally distributed, the data plot will follow the line.

# a lot pretty qqplot in this library
library(car)
qqPlot(residuals)

##    45    37 
## 16022 13570

Once it passes the -2 and 2 quantiles the data points tend to stray away from the line, but in between those quantiles, they follow the line quite closely. Another way to visually see if the residuals are normally distributed is to look at a density plot of the residuals.

plot(density(residuals))

Seeing the density plot, most of the data is hovered around the zero marker, but there seems to be a right-skew with the data. Additionally, there seems to be some residual values that are much smaller or larger than zero, meaning there may be a problem with outliers.

After looking at both of the visualizations, one can see that the residuals are not normal, but to make sure of that, I will run a Jarque-Bera test, where the null hypothesis is that the data is normally distributed, and I will use the residuals to see if the residuals are normally distributed.

library(tseries)
# test for normality of residuals
jarque.bera.test(residuals)
## 
##  Jarque Bera Test
## 
## data:  residuals
## X-squared = 19289804, df = 2, p-value < 2.2e-16

As one can see above, the p-value is below alpha, meaning that the residuals are not normally distributed.

Fixing the Model

Similar to fixing heteroskedasticity, a way to fix the residuals to hopefully make them normally distributed is to transform the dependent variable. So, to create a better model, I used Box-Cox transformation to determine the best type of transformation for the response variable.

library(MASS)
boxcox(lm(Num_kudos ~ Num_hits + hits2 + Num_comments + comments2 + Num_hits*Num_comments + Rating + Five_tags + Has_romance + Has_platonic + Is_crossover + Rating*Has_romance + Five_tags*Has_romance, data=train))

From the plot above, it has been determined that a log transformation of the dependent variable would best normalize the data.

model3 <- lm(log(Num_kudos) ~ Num_hits + hits2 + Num_comments + comments2 + Num_hits*Num_comments + Rating + Five_tags + Has_romance + Has_platonic + Is_crossover + Rating*Has_romance + Five_tags*Has_romance, data = train)
summary(model3)
## 
## Call:
## lm(formula = log(Num_kudos) ~ Num_hits + hits2 + Num_comments + 
##     comments2 + Num_hits * Num_comments + Rating + Five_tags + 
##     Has_romance + Has_platonic + Is_crossover + Rating * Has_romance + 
##     Five_tags * Has_romance, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5246 -0.7318  0.1274  0.8315  5.2673 
## 
## Coefficients:
##                                               Estimate Std. Error t value
## (Intercept)                                  1.622e+00  5.904e-02  27.470
## Num_hits                                     4.802e-04  4.763e-06 100.828
## hits2                                       -1.092e-08  1.730e-10 -63.107
## Num_comments                                 6.728e-04  2.231e-04   3.015
## comments2                                   -6.841e-07  2.127e-07  -3.216
## RatingGeneral Audiences                      4.101e-01  6.355e-02   6.453
## RatingMature                                 3.455e-01  7.159e-02   4.826
## RatingNot Rated                              1.905e-01  6.917e-02   2.754
## RatingTeen And Up Audiences                  4.549e-01  6.341e-02   7.175
## Five_tagsTRUE                                6.225e-01  2.823e-02  22.047
## Has_romanceTRUE                              1.174e+00  6.079e-02  19.310
## Has_platonicTRUE                             3.021e-01  1.496e-02  20.195
## Is_crossoverTRUE                            -6.694e-02  1.508e-02  -4.439
## Num_hits:Num_comments                        1.747e-08  1.181e-08   1.479
## RatingGeneral Audiences:Has_romanceTRUE     -6.127e-01  6.731e-02  -9.102
## RatingMature:Has_romanceTRUE                -5.203e-01  7.500e-02  -6.938
## RatingNot Rated:Has_romanceTRUE             -3.924e-01  7.427e-02  -5.284
## RatingTeen And Up Audiences:Has_romanceTRUE -5.605e-01  6.649e-02  -8.429
## Five_tagsTRUE:Has_romanceTRUE               -3.824e-01  3.186e-02 -12.006
##                                             Pr(>|t|)    
## (Intercept)                                  < 2e-16 ***
## Num_hits                                     < 2e-16 ***
## hits2                                        < 2e-16 ***
## Num_comments                                 0.00257 ** 
## comments2                                    0.00130 ** 
## RatingGeneral Audiences                     1.11e-10 ***
## RatingMature                                1.40e-06 ***
## RatingNot Rated                              0.00589 ** 
## RatingTeen And Up Audiences                 7.41e-13 ***
## Five_tagsTRUE                                < 2e-16 ***
## Has_romanceTRUE                              < 2e-16 ***
## Has_platonicTRUE                             < 2e-16 ***
## Is_crossoverTRUE                            9.06e-06 ***
## Num_hits:Num_comments                        0.13926    
## RatingGeneral Audiences:Has_romanceTRUE      < 2e-16 ***
## RatingMature:Has_romanceTRUE                4.06e-12 ***
## RatingNot Rated:Has_romanceTRUE             1.28e-07 ***
## RatingTeen And Up Audiences:Has_romanceTRUE  < 2e-16 ***
## Five_tagsTRUE:Has_romanceTRUE                < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.129 on 30249 degrees of freedom
##   (236 observations deleted due to missingness)
## Multiple R-squared:  0.4995, Adjusted R-squared:  0.4992 
## F-statistic:  1677 on 18 and 30249 DF,  p-value: < 2.2e-16
y_pred3 <- predict(model3, new_data=test)
mape(test$Num_kudos, y_pred3)
## [1] 0.8411268
AIC(model3)
## [1] 93274.18

While the AIC decreased, meaning it is a better model, and the MAPE decreased too, even though the R-squared should not be compared across models, the abrupt drop still is not the nicest. Also, since ‘Num_kudos’ has shifted with a log transformation, I will plot the variable and take its correlation to see if the numerical variable, ‘Num_hits’ would also benefit from a log transformation (more so than a squared transformation). I cannot test for ‘Num_comments’ because there are zero values in the data set, and log(0) equals infinity, so it cannot have a log transformation.

# only target variable log transformation
ggplot(ao3, aes(x=Num_hits, y=log(Num_kudos))) +
    geom_point() +
    geom_smooth(method=lm)

cor(ao3$Num_hits,log(ao3$Num_kudos))
## [1] 0.3179323
# both log transformation
ggplot(ao3, aes(x=log(Num_hits), y=log(Num_kudos))) +
    geom_point() +
    geom_smooth(method=lm)

cor(log(ao3$Num_hits),log(ao3$Num_kudos))
## [1] 0.899418

As seen above, we must also transform the ‘Num_hits’ variable to fit the model better, so instead of the squared transformation from before, it will be a log transformation.

model4 <- lm(log(Num_kudos) ~ log(Num_hits) + Num_comments + comments2 + Rating + Five_tags + Has_romance + Has_platonic + Is_crossover + Num_hits*Num_comments + Rating*Has_romance + Five_tags*Has_romance, data = train)
summary(model4)
## 
## Call:
## lm(formula = log(Num_kudos) ~ log(Num_hits) + Num_comments + 
##     comments2 + Rating + Five_tags + Has_romance + Has_platonic + 
##     Is_crossover + Num_hits * Num_comments + Rating * Has_romance + 
##     Five_tags * Has_romance, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1323 -0.3897  0.0726  0.4716  2.7846 
## 
## Coefficients:
##                                               Estimate Std. Error t value
## (Intercept)                                 -3.072e+00  3.890e-02 -78.979
## log(Num_hits)                                8.819e-01  2.805e-03 314.426
## Num_comments                                 1.820e-04  4.757e-05   3.827
## comments2                                   -2.753e-08  7.208e-09  -3.820
## RatingGeneral Audiences                      1.298e+00  3.814e-02  34.037
## RatingMature                                 9.071e-01  4.281e-02  21.188
## RatingNot Rated                              1.036e+00  4.147e-02  24.995
## RatingTeen And Up Audiences                  1.143e+00  3.796e-02  30.107
## Five_tagsTRUE                                1.794e-01  1.695e-02  10.584
## Has_romanceTRUE                              8.220e-01  3.633e-02  22.628
## Has_platonicTRUE                             6.710e-02  8.982e-03   7.471
## Is_crossoverTRUE                            -1.208e-01  8.998e-03 -13.427
## Num_hits                                    -4.764e-06  5.805e-07  -8.207
## Num_comments:Num_hits                        7.064e-10  1.200e-10   5.889
## RatingGeneral Audiences:Has_romanceTRUE     -8.892e-01  4.021e-02 -22.111
## RatingMature:Has_romanceTRUE                -8.166e-01  4.480e-02 -18.228
## RatingNot Rated:Has_romanceTRUE             -8.579e-01  4.439e-02 -19.328
## RatingTeen And Up Audiences:Has_romanceTRUE -8.171e-01  3.971e-02 -20.577
## Five_tagsTRUE:Has_romanceTRUE               -7.780e-02  1.904e-02  -4.086
##                                             Pr(>|t|)    
## (Intercept)                                  < 2e-16 ***
## log(Num_hits)                                < 2e-16 ***
## Num_comments                                0.000130 ***
## comments2                                   0.000134 ***
## RatingGeneral Audiences                      < 2e-16 ***
## RatingMature                                 < 2e-16 ***
## RatingNot Rated                              < 2e-16 ***
## RatingTeen And Up Audiences                  < 2e-16 ***
## Five_tagsTRUE                                < 2e-16 ***
## Has_romanceTRUE                              < 2e-16 ***
## Has_platonicTRUE                            8.17e-14 ***
## Is_crossoverTRUE                             < 2e-16 ***
## Num_hits                                    2.36e-16 ***
## Num_comments:Num_hits                       3.93e-09 ***
## RatingGeneral Audiences:Has_romanceTRUE      < 2e-16 ***
## RatingMature:Has_romanceTRUE                 < 2e-16 ***
## RatingNot Rated:Has_romanceTRUE              < 2e-16 ***
## RatingTeen And Up Audiences:Has_romanceTRUE  < 2e-16 ***
## Five_tagsTRUE:Has_romanceTRUE               4.40e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6769 on 30485 degrees of freedom
## Multiple R-squared:  0.8292, Adjusted R-squared:  0.8291 
## F-statistic:  8220 on 18 and 30485 DF,  p-value: < 2.2e-16
y_pred4 <- predict(model4, new_data=test)
mape(test$Num_kudos, y_pred4)
## [1] 0.8693914
AIC(model4)
## [1] 62777.98

With this new model, while it does have a slightly higher MAPE (though it is almost negligible), the AIC of the new model is much lower than any other model that has been created so far. In this model, log ‘Num_hits’, ‘Num_comments’, ‘Rating’, ‘Five_tags’, ‘Has_romance’, and ‘Has_platonic’ all had positive coefficients, meaning that these will increase the amount of positive fan interaction (log number of likes) for a book, movie, or show. Comments squared, ‘Is_crossover’, and normal ‘Num_hits’ have negative coefficients, so it will overall decrease the positive fan interaction.

With the interaction variables, as the number of views (‘Num_hits’) and comments increase, according to the coefficient given in the summary table, the positive fan interaction increases. With ‘Rating’, the reference is ‘Explicit’, so this means with the non-interaction term, a fanfiction performs better if it is rating something other than ‘Explicit’. ‘Rating’ with ‘Has_romance’ uses ‘Explicit’ as a reference again, and if romance is involved, if the rating is not ‘Explicit’, it actually decreases the overall positive fan interaction. ‘Five_tags’ with ‘Has_romance’ has a negative coefficient as well, meaning if any top five tag is included in a story with a romantic relationship, this would decrease the overall fan interaction.

Now, I check the heteroskedasticity and normality like what I did earlier with a previous model.

residuals_m4 <- resid(model4)
fitted.values_m4 <- fitted(model4)
# run heteroskedasticity test again
bptest(model4)
## 
##  studentized Breusch-Pagan test
## 
## data:  model4
## BP = 540.3, df = 18, p-value < 2.2e-16
# plot to see heteroskedasticity
plot(fitted.values_m4,residuals_m4) + abline(0,0)

## integer(0)

Unfortunately, it is still heteroskedastic according to the Breusch-Pagan test, but the plot’s residuals are a lot smaller and tend to be clumped together – there are not as many outliers in the plot, so compared to the previous plot, it is less heteroskedastic. Next, we use a Q-Q plot and see the density of the residuals to see if it is normally distributed.

# see if residuals are normal
qqPlot(residuals_m4)

## 38111 36743 
## 18869 10917
# another visual to see if residuals are normal
plot(density(residuals_m4))

# normal test
jarque.bera.test(residuals_m4)
## 
##  Jarque Bera Test
## 
## data:  residuals_m4
## X-squared = 3233.9, df = 2, p-value < 2.2e-16

The Q-Q plot is much best to the previous plot though the tails of the points are not as linear, but it is more linear compared to the previous plot. Also, the density plot does look mostly normally distributed, but there is a bit of a left-skew, making it not as normal. When I do the Jarque-Bera test, it still rejects the null hypothesis, meaning the residuals are not normally distributed. Overall, this new model is better than the previous model, but it does still have issues with heteroskedasticity and normality of the residuals.

Weaknesses

When analyzing the model, I found that there was a huge problem with outliers that should have been handled early in the data. If I choose to scrape data that was more fan-community specific, such as Harry Potter, the values would likely not be as extreme because the data set includes data for communities that do not have a lot of fan interaction due to age or being cult classics like the movie Event Horizon, having only a total of a little over 25 fanfictions ever created, and communities with wide spread popularity, such as The Avengers with over 215,000 fanfictions, having tons a fans interact with that content.

Additionally,despite my attempts and transforming this model, I could not make the errors and residuals homoskedastic and normally distributed, meaning that a weighed regression may need to be used to fix those issues. Though, the log transformation did help the heteroskedasticity and non-normality a bit, which is shown in the plots above.

Conclusion

AO3 is a website containing tons of stories that fans interact with every day by either writing or reading. When fans look for more content for a series, after checking officially released options, they tend to go to this site. With its comprehensive tagging system, it is consider to be the best of the fanfiction sites out there, and the site does really well with providing data. Because of this, it is a great option to use to determine what tags, ratings, or relationships would best help a series if they wished to see what fans interacted with the most. But, it can also give insight to the broader view of what is generally liked as well if one wishes to create a new book or movie.

With this model, one can predict the amount of positive fan interaction (likes) that a book, movie, or show may have given certain variables, such as number of views, what rating it has, types of relationships (if there are any), and different tags (different broad tropes) used. This can be beneficial for media companies, groups, or individuals. For example, it can help an author determine what tropes, ratings, and relationships to have if they are writing a new book, and this can help to determine if its broad ideas can gain traction. Additionally, this can be beneficial for companies who may be creating a sequel to a series, where if they tailor the data to be more genre or series specific, they can see how well the fan community will receive an idea.

By helping to determine what gains positive fan interactions, this can help with the retention of current fans of a series while it can also help to bring in new viewers/readers. Since I used a data set consisting of all possible medias of books, shows, or movies, I create a model that would be beneficial in determining the best broad idea for a general audience. If one were to web scrape more specific data, such as only scraping The Reckoners series content, they one can find what is best for for that series specifically.

Therefore, if the media company or individual author wishes to determine the amount of positive fan interaction, one can use the model to at least figure out some of the more broader aspects that should be in the piece of media. One can determine whether romantic or platonic relationships should be include (and be prominent) or whether one should add key concepts with the top five tags given in the works, such as including cute and happy content (fluff) or including sadder content (angst). Hopefully, it will help to attract both veteran fans and newcomers into the series by including concepts that were generally well-received by using a website that thrives from fan interaction.